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Abstract 

In  most  classical  frameworks  for  learning  from  examples,  it  is  assumed  that  examples  are  randomly  drawn 
and  presented  to  the  learner,  In  this  paper,  we  consider  the  possibility  of  a  more  active  learner  who  is 
allowed  to  choose  his/her  own  examples.  Our  investigations  are  carried  out  in  a  function  approximation 
setting.  In  particular,  using  arguments  from  optimal  recovery  (Micchelli  and  Rivlin,  1976),  we  develop  an 
adaptive  sampling  strategy  (equivalent  to  adaptive  approximation)  for  arbitrary  approximation  schemes. 
We  provide  a  general  formulation  of  the  problem  and  show  how  it  can  be  regarded  as  sequential  optimal 
recovery.  We  demonstrate  tlie  application  of  this  general  formulation  to  two  special  cases  of  functions 
on  the  real  line  1)  monotonically  increasing  functions  and  2)  functions  with  bounded  derivative.  An 
extensive  investigation  of  the  sample  complexity  of  approximating  these  functions  is  conducted  yielding 
both  theoretical  and  empirical  results  on  test  functions.  Our  theoretical  results  (stated  in  PAC-style), 
along  with  the  simulations  demonstrate  the  superiority  of  our  active  scheme  over  both  passive  learning 
as  well  as  classical  optimal  recovery.  The  analysis  of  active  function  approximation  is  conducted  in  a 
worst-case  setting,  in  contrast  with  other  Bayesian  paradigms  obtained  from  optimal  design  (Mackay, 
1992). 


Copyright  ©  Massachusetts  Institute  of  Technology.  1994 


This  report  describes  research  done  at  the  Center  for  Biological  and  Computational  Learning  and  the  Artificial  Intelligence 
Laboratory  of  the  Massachusetts  Institute  of  Technology.  Support  for  the  Center  is  provided  in  part  by  a  grant  from  the 
National  Science  Foundation  under  contract  ASC-9217041. 


19950628  002 


1  Introduction 

In  the  classical  paradigm  of  learning  from  examples, 
the  data  or  examples,  are  typically  drawn  according  to 
some  fixed,  unknown,  arbitrary  probal)ility  distribution. 
This  is  the  case  for  a  wide  class  of  problems  in  the 
PAC  t  \  aliant.  1984)  framework,  as  well  as  other  familiar 
frameworks  in  the  connectionist  (Rumelhart  et  al,  1986 
I  and  pattern  recognition  literature.  In  this  important 
sense,  the  learner  is  merely  a  passive  recipient  of  infor¬ 
mation  about  the  target  function.  In  this  paper,  we  con¬ 
sider  the  possibility  of  a  more  active  hoarner.  There  are  of 
course  a  myriad  of  ways  in  which  a  learner  could  be  more 
active.  Consider,  for  example,  the  extreme  pathological 
case  where  the  learner  simply  asks  for  the  true  target 
function  which  is  duly  provided  by  an  obliging  oracle. 
This,  the  reader  will  quickly  realize  is  hardly  interesting. 
Such  pathological  cases  aside,  this  theme  of  activity  on 
the  part  of  the  learner  has  been  explored  (though  it  is 
not  always  conceived  as  such)  in  a  number  of  different 
settings  (PAC-style  concept  learning,  boundary-hunting 
pattern  recognition  schemes,  adai)tiv('  integration,  opti¬ 
mal  sampling  etc.)  in  more  principled  ways  and  we  will 
comment  on  these  in  due  course. 

For  our  purposes,  we  restrict  our  attention  in  this  pa¬ 
per  to  the  situation  where  the  learner  is  allowed  to  choose 
its  own  examples^,  for  function  approximation  tasks.  In 
other  words,  the  learner  is  allowed  to  decide  where  in  the 
domain  D  (for  functions  defined  from  1)  to  >')  it  would 
like  to  sample  the  target  function.  .Xote  that  this  is 
in  direct  contrast  to  the  pa.ssive  case  where  the  learner 
is  presented  with  randomly  drawn  examples.  Keeping- 
other  factors  in  the  learning  paradigm  unchanged,  we 
then  compare  in  this  paper,  the  active  and  passive  learn¬ 
ers  who  differ  only  in  their  method  of  collecting  exam¬ 
ples.  At  the  outset,  we  are  particularly  interested  in 
whether  there  exist  principled  ways  of  collecting  exam¬ 
ples  in  the  first  place.  A  second  important  consideration 
is  whether  these  ways  allow  the  learner  to  learn  with  a 
Kwer  number  of  examples.  This  latter  question  is  partic¬ 
ularly  important  as  one  needs  to  a.s.sess  the  advantage, 
from  an  information-theoretic  point  of  view,  of  active 
learning. 

Are  there  principled  ways  to  choose  examples?  We 
develop  a  general  framework  for  collecting  (choosing) 
examples  for  approximating  (learning)  real- valued  func¬ 
tions.  This  can  be  viewed  as  a  sequential  version  of  op¬ 
timal  recovery  (Michhelli  and  Rivlin,  1977):  a  scheme 
for  optimal  sampling  of  functions.  Such  an  active  learn¬ 
ing  scheme  is  consequently  in  a  worst-case  setting,  in 
contrast  with  other  schemes  (Mackay,  1992:  Cohn,  1993; 
Sollich.  1993)  that  operate  within  a  Bayesian,  average- 

*This  can  be  regarded  as  a  computational  instantiation 
of  the  psychological  practice  of  selective  attention  where  a 
human  might  choose  to  selectively  concentrate  on  interesting 
or  confusing  regions  of  the  feature  space  in  order  to  better 
grasp  the  underlying  concept.  Consider,  for  example,  the  sit¬ 
uation  when  one  encounters  a  speaker  with  a  foreign  accent. 
One  cues  in  to  this  foreign  speech  by  focusing  on  and  then 
adapting  to  its  distinguishing  properties.  This  is  often  ac¬ 
complished  by  asking  the  speaker  to  repeat  words  which  are 
confusing  to  us. 


case  paradigm.  We  then  demonstrate  the  application  of 
sequential  optimal  recovery  to  some  specific  classes  of 
functions.  We  obtain  theoretical  bounds  on  the  sam¬ 
ple  complexity  of  the  active  and  passive  learners,  and 
perform  some  empirical  simulations  to  demonstrate  the 
superiority  of  the  active  learner. 

2  A  General  Framework  For  Active 
Approximation 

2.1  Preliminaries 

We  need  to  develop  the  following  notions: 

T\  Let  f  denote  a  class  of  functions  from  some  domain 
D  to  Y  where  V'  is  a  subset  of  the  real  line.  The  do¬ 
main  D  is  typically  a  sub.set  of  though  it  could  be 
more  general  than  that.  There  is  some  unknown  target 
function  f  ^  T  which  has  to  be  approximated  by  an 
approximation  scheme. 

D:  This  is  a  data  set  obtained  by  sampling  the  target 
j  ^  jT  at  a  number  of  points  in  its  domain.  Thus, 

V  =  €  D.y,  ==  /!>•/),  1  ...n} 

Notice  that  the  data  is  uncorrupted  by  noise. 

77:  This  is  a  class  of  functions  (also  from  D  to  Y)  from 
which  the  learner  will  choose  one  in  an  attempt  to  ap¬ 
proximate  the  target  /.  Notationally,  we  will  use  77 
to  refer  not  merely  to  the  class  of  functions  (hypothe¬ 
sis  class)  but  also  the  algorithm  by  means  of  which  the 
learner  picks  an  approximating  function  /i  E  77  on  the 
basis  of  the  data  set  V.  In  other  words,  77  denotes  an  ap¬ 
proximation  scheme  which  is  really  a  tuple  <  77,  A  >  . 
A  is  an  algorithm  that  takes  as  its  input  the  data  set  P, 
and  outputs  an  /i  E  77.  ^ 

Examples:  If  we  consider  real- valued  functions  from  R 
to  R,  some  typical  examples  of  77  are  the  class  of  poly¬ 
nomials  of  a  fixed  order  (say  q).  splines  of  some  fixed 
order,  radial  basis  functions  with  some  bound  on  the 
number  of  nodes,  etc.  As  a  concrete  example,  consider 
functions  from  [0.  1]  to  R.  Imagine  a  data  set  is  collected 
which  consists  of  examples,  i.e..  \Xi.  yi)  pairs  as  per  our 
notation.  Without  loss  of  generality,  one  could  assume 
that  Xi  <  for  each  i.  Then  a  cubic  (degree-3)  spline 
is  obtained  by  interpolating  the  data  points  by  poly¬ 
nomial  pieces  (with  the  pieces  tied  together  at  the  data 
points  or  “knots” )  such  that  the  overall  function  is  twice- 
differentiable  at  the  knots.  Fig.  1  shows  an  example  of 
an  arbitrary  data  set  fitted  by  cubic  splines. 

dc  '  We  need  a  metric  to  determine  how  good  the  ap¬ 
proximation  learner’s  approximation  is.  Specifically,  the 
metric  dc  measures  the  approximation  error  on  the  re¬ 
gion  C  of  the  domain  D.  In  other  words,  c/c;  takes  as  its 
input  any  two  functions  (say  fi  and  f^)  from  77  to  77  and 
outputs  a  real  number.  It  is  assumed  that  dc  satisfies 
all  the  requisites  for  being  a  real  distance  metric  on  the 
appropriate  space  of  functions.  Since  the  approximation 
error  on  a  larger  domain  is  obviously  going  to  be  greater 
than  that  on  the  smaller  domain,  we  can  make  the  follow¬ 
ing  two  observations:  1)  for  any  two  sets  Ci  and  C2  such 


scheme  can  P-PAC  learn  the  class  for  every  distribution 
P. 


Figure  1:  An  arbitrary  data  set  fitted  with  cubic  splines 


that  Cl  C  C2,  dc,(fuh)  <  dc,(/i,/2).  2)  doifijo)  is 
the  total  approximation  on  the  entire  domain:  this  is  our 
basic  criterion  forjudging  the  ‘‘goodness"  of  the  learner's 
hypothesis. 

Examples:  For  real- valued  functions  from  to  R,  the 
metric  defined  as  dc(/i,/2)  =  ( fc  ■“ 
serves  as  a  natural  example  of  an  error  metric. 

C:  This  is  a  collection  of  subsets  C  of  the  domain.  We  are 
assuming  that  points  in  the  domain  where  the  function  is 
sampled,  divide  (partition)  the  domain  into  a  collection 
of  disjoint  sets  Ci  E  C  such  that  UjCjC^  =  D. 

Examples:  For  the  case  of  functions  from  [0.  1]  to  R, 
and  a  data  set  a  natural  way  in  which  to  partition  the 
domain  [0,  1]  is  into  the  intervals  jr.  +  i ).  (here  again, 
without  loss  of  generality  we  have  assumed  that  Xi  < 
Xjj^i ).  The  set  C  could  be  the  set  of  all  (closed,  open,  or 
half-open  and  half-closed)  intervals  [a,b]  C  [0. 1_. 

The  goal  of  the  learner  (operating  with  an  approxima¬ 
tion  scheme  H)  is  to  provide  a  hypothesis  h  e  R  (which 
it  chooses  on  the  basis  of  its  example  set  P)  as  an  ap¬ 
proximator  of  the  unknown  target  function  /  G  P.  We 
now  need  to  formally  lay  down  a  criterion  for  assessing 
the  competence  of  a  learner  (approximation  scheme).  In 
recent  times,  there  has  been  much  use  of  PAC  (Valiant 
1984)  like  criteria  to  assess  learning  algorithms.  Such  a 
criterion  has  been  used  largely  for  concept  learning  but 
some  extensions  to  the  case  of  real  valued  functions  ex¬ 
ist  (Haussler  1989).  We  adapt  here  for  our  purposes  a 
PAC  like  criterion  to  judge  the  efficacy  of  approximation 
schemes  of  the  kind  described  earlier. 

Definition  1  An  approximation  scheme  is  said  to  P- 
PAC  learn  the  function  f  e  T  if  for  every  e  >  0  and 
1  >  6  >  0,  and  for  an  arbitrary  distribution  P  on  D, 
it  collects  a  data  set  P,  and  computes  a  hypothesis  h  E 
R  such  that  doih,/)  <  e  with  probability  greater  than 
1-6.  The  function  class  T  is  P-PAC  learnable  if  the 
approximation  scheme  can  P-PAC  learn  every  function 
in  T .  The  class  T  is  PAC  learnable  if  the  approximation  ^ 


There  is  an  important  clarification  to  be  made  about 
our  definition  above.  Note  that  the  distance  metric  d 
is  arbitrary.  It  need  not  be  naturally  related  to  the 
distribution  P  according  to  which  the  data  is  drawn. 
Recall  that  this  is  not  so  in  typical  distance  metrics 
used  in  classical  PAC  formulations.  For  example,  in 
concept  learning,  where  the  set  P  consists  of  indicator 
functions,  the  metric  used  is  the  Li(P)  metric  given  by 
c/(1a,1b)  =  fjD  \Ia  —  lB\Pi^)dx.  Similarly,  extensions 
to  real- valued  functions  typically  use  an  L2{P)  metric. 
The  use  of  such  metrics  imply  that  the  training  error  is 
an  empirical  average  of  the  true  underlying  error.  One 
can  then  make  use  of  convergence  of  empirical  means  to 
true  means  (Vapnik,  1982)  and  prove  learnability.  In  our 
case,  this  is  not  necessarily  the  case.  For  example,  one 
could  always  come  up  with  a  distribution  P  which  would 
never  allow  a  passive  learner  to  see  examples  in  a  certain 
region  of  the  domain.  However,  the  arbitrary  metric  d 
might  weigh  this  region  heavily.  Thus  the  learner  would 
never  be  able  to  learn  such  a  function  class  for  this  met¬ 
ric.  In  this  sense,  our  model  is  more  demanding  than 
classical  PAC.  To  make  matters  easy,  we  will  consider 
here  the  case  of  P  —  PAC  learnability  alone,  where  P 
is  a  known  distribution  (uniform  in  the  example  cases 
studied).  However,  there  is  a  sense  in  which  our  notion 
of  PAC  is  easier  — the  learner  knows  the  true  metric  d 
and  given  any  two  functions,  can  compute  their  rela¬ 
tive  distance.  This  is  not  so  in  classical  PAC,  where  the 
learner  cannot  compute  the  distance  between  two  func¬ 
tions  since  it  does  not  know  the  underlying  distribution. 

We  have  left  the  mechanism  of  data  collection  un¬ 
defined.  Our  goal  here  is  the  investigation  of  different 
methods  of  data  collection.  A  baseline  against  which 
we  will  compare  all  such  schemes  is  the  passive  method 
of  data  collection  where  the  learner  collects  its  data  set 
by  sampling  D  according  to  P  and  receiving  the  point 
(x,  f{x)).  If  the  learner  were  allowed  to  draw  its  own  ex¬ 
amples,  are  there  principled  ways  in  which  it  could  do 
this?  Further,  as  a  consequence  of  this  flexibility  ac¬ 
corded  to  the  learner  in  its  data  gathering  scheme,  could 
it  learn  the  class  T  with  fewer  examples?  These  are  the 
questions  we  attempt  to  resolve  in  this  paper,  and  we 
begin  by  motivating  and  deriving  in  the  next  section,  a 
general  framework  for  active  selection  of  data  for  arbi¬ 
trary  approximation  schemes. 
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2.2  The  Problem  of  Collecting  Examples 

We  have  introduced  in  the  earlier  section,  our  baseline 
algorithm  for  collecting  examples.  This  corresponds  to 
a  passive  learner  that  draws  examples  according  to  the 
probability  distribution  P  on  the  domain  D.  If  such  a 
passive  learner  collects  examples  and  produces  an  out¬ 
put  h  such  that  doih.f)  is  less  than  e  with  probability 
greater  than  1  -  6,  it  P-PAC  learns  the  function.  The 
number  of  examples  that  a  learner  needs  before  it  pro¬ 
duces  such  an  (e-good, ^-confidence)  hypothesis  is  called 
its  sample  complexity. 

Against  this  baseline  passive  data  collection  scheme, 
lies  the  possibility  of  allowing  the  learner  to  choose  its  ^ 

......  aQd/®[|P 


own  examples.  At  the  outset  it  might  seem  reasonable 
to  believe  that  a  data  set  would  provide  the  learner  with 
some  information  about  the  target  function;  in  particu¬ 
lar,  it  would  probably  inform  it  about  the  “interesting 
regions  of  the  function,  or  regions  where  the  approxima¬ 
tion  error  is  high  and  need  further  sampling.  On  the 
basis  of  this  kind  of  information  (along  with  other  infor¬ 
mation  about  the  class  of  functions  in  general)  one  might 
be  able  to  decide  where  to  sample  next.  We  formalize 
this  notion  as  follows: 

Let  V  =  {{xi.yi)\i  =  l...n}  be  a  data  set  (con¬ 
taining  n  data  points)  which  the  learner  has  access  to. 
The  approximation  scheme  acts  upon  this  data  set  and 
picks  an  /z  G  (which  best  fits  the  data  according  to 
the  specifics  of  the  algorithm  A  inherent  in  the  approx¬ 
imation  scheme).  Further,  let  Q;  z  =  1,  .  . /\(n)^  be  a 
partition  of  the  domain  D  into  different  regions  on  the 
basis  of  this  data  set.  Finally  let 

Tv  =  {fe  T\f{xi}  r=  yi  ^(x^^yi)  G  V} 

This  is  the  set  of  all  functions  in  which  are  consistent 
with  the  data  seen  so  far.  The  target  function  could  be 
any  one  of  the  functions  in  Tv- 

We  first  define  an  error  criterion  ec  (where  C  is  any 
subset  of  the  domain)  as  follows: 

eciH.V.T)^  sup  dc(hj) 

/€^x) 

Essentially,  ec  is  a  measure  of  the  maximum  possi¬ 
ble  error  the  approximation  scheme  could  have  (over  the 
region  C)  given  the  data  it  has  seen  so  far.  It  clearly  de¬ 
pends  on  the  data,  the  approximation  scheme,  and  the 
class  of  functions  being  learned.  It  does  not  depend  upon 
the  target  function  (except  indirectly  in  the  sense  that 
the  data  is  generated  by  the  target  function  after  all,  and 
this  dependence  is  already  captured  in  the  expression). 
We  thus  have  a  scheme  to  measure  uncertainty  (maxi¬ 
mum  possible  error)  over  the  different  regions  of  the  in¬ 
put  space  D.  One  possible  strategy  to  select  a  new  point 
might  simply  be  to  sample  the  function  in  the  region  C) 
where  the  error  bound  is  the  highest.  Let  us  assume  we 
have  a  procedure  V  to  do  this.  V  could  be  to  sample 
the  region  C  at  the  centroid  of  C,  or  sampling  C  accord¬ 
ing  to  some  distribution  on  it,  or  any  other  method  one 
might  fancy.  This  can  be  described  as  follows: 

Active  Algorithm  A 

1.  [Initialize]  Collect  one  example  (x  1,2/1)  by  sam¬ 
pling  the  domain  D  once  according  to  procedure 
V. 

2.  [Obtain  New  Partitions]  Divide  the  domain  D 
into  regions  Ci, . . . ,  Ck{i)  on  the  basis  of  this  data 
point. 

^The  number  of  regions  K{n)  into  which  the  domain  D 
is  partitioned  by  n  data  points  depends  upon  the  geometry 
of  D  and  the  partition  scheme  used.  For  the  real  line  parti¬ 
tioned  into  intervals  as  in  our  example,  K{n)  =  n  -f*  1.  For 
it-cubes,  one  might  obtain  Voronoi  partitions  and  compute 
K{n)  accordingly. 


3.  [Compute  Uncertainties]  Compute  ec,  for  each 

i. 

4.  [General  Update  and  Stopping  Rule]  In  gen¬ 

eral,  at  the  jth  stage,  suppose  that  our  partition 
of  the  domain  D  is  into  Ci,i  =  One 

can  compute  ec,  for  each  i  and  sample  the  region 
with  maximum  uncertainty  (say  Ci)  according  to 
procedure  V.  This  would  provide  a  new  data  point 
(xj^icyj+i)-  The  new  data  point  would  re-partion 
the  domain  D  into  new  regions.  At  any  stage,  if 
the  maximum  uncertainty  over  the  entire  domain 
ev  is  less  than  e  stop. 

The  above  algorithm  is  one  possible  active  strategy. 
However,  one  can  carry  the  argument  a  little  further  and 
obtain  an  optimal  sampling  strategy  which  would  give 
us  a  precise  location  for  the  next  sample  point.  Imagine 
for  a  moment,  that  the  learner  asks  for  the  value  of  the 
function  at  a  point  x  £  D.  The  value  returned  obviously 
belongs  to  the  set 

Tvix)  =  {f{x)\f  e  Tv} 

Assume  that  the  value  observed  was  y  G  Tvi^)- In  effect, 
the  learner  now  has  one  more  example,  the  pair  (x,y), 
which  it  can  add  to  its  data  set  to  obtain  a  new,  larger 
data  set  where 

V*  =  V[J(x,y) 

Once  again,  the  approximation  scheme  H  would  map 
the  new  data  set  X>'  into  a  new  hypothesis  /i'.  One  can 
compute 

ec{n,V\T)^  sup  d{h\J) 

Clearly,  evi'H.V^ ,T)  now  measures  the  maximum 
possible  error  after  seeing  this  new  data  point.  This 
depends  upon  (x,y)  (in  addition  to  the  usual  and 

T).  For  a  fixed  x,  we  don’t  know  the  value  of  y  we  would 
observe  if  we  had  chosen  to  sample  at  that  point.  Con¬ 
sequently,  a  natural  thing  to  do  at  this  stage  is  to  again 
take  a  worst  case  bound,  i.e.,  assume  we  would  get  the 
most  unfavorable  y  and  proceed.  This  would  provide  the 
maximum  possible  error  we  could  make  if  we  had  chosen 
to  sample  at  x.  This  error  (over  the  entire  domain)  is 

sup  eDin,V\T)=:  sup  eDin,VU{x,y),T) 

ye^T>(x) 

Naturally,  we  would  like  to  sample  the  point  x  for  which 
this  maximum  error  is  minimized.  Thus,  the  optimal 
point  to  sample  by  this  argument  is 

=  argmin  sup  evi'H^V  U  {x,y),T)  (1) 

^^^ye:Fvix) 

This  provides  us  with  a  principled  strategy  to  choose 
our  next  point.  The  following  optimal  active  learning 
algorithm  follows: 

Active  Algorithm  B  (Optimal) 

1.  [Initialize]  Collect  one  example  {xi,yi)  by  sam¬ 
pling  the  domain  D  once  according  to  procedure 
V.  We  do  this  because  without  any  data,  the  ap¬ 
proximation  scheme  would  not  be  able  to  produce 
any  hypothesis. 


2.  [Compute  Next  Point  to  Sample]  Apply  eq.  1 
and  obtain  ^2-  Sampling  the  function  at  this  point 
yields  the  next  data  point  {x2,y2)  which  is  added 
to  the  data  set. 

3.  [General  Update  and  Stopping  Rule]  In  gen¬ 
eral,  at  the  jth  stage,  assume  we  have  in  place  a 
data  set  Vj  (consisting  of  j  data).  One  can  com¬ 
pute  Xj^i  according  to  eq.  1  and  sampling  the  func¬ 
tion  here  one  can  obtain  a  new  hypothesis  and  a 
new  data  set  P;  +  i.  In  general,  as  in  Algorithm  A. 
stop  whenever  the  total  error  eo['H,Vk^  J')  is  less 
than  6. 

By  the  process  of  derivation,  it  should  be  clear  that 
if  we  chose  to  sample  at  some  point  other  than  that  ob¬ 
tained  by  eq.  1.  an  adversary  could  provide  a  y  value  and 
a  function  consistent  with  all  the  data  provided  (includ¬ 
ing  the  new  data  point),  that  would  force  the  learner  to 
make  a  larger  error  than  if  the  learner  chose  to  sample 
at  Xnew‘  In  this  sense,  algorithm  B  is  optimal.  It  also 
differs  from  algorithm  A,  in  that  it  does  not  require  a 
partition  scheme,  or  a  procedure  V  to  clioose  a  point  in 
some  region.  However,  the  computation  Xnew  inherent 
in  algorithm  B  is  typically  more  intensive  than  computa¬ 
tions  required  by  algorithm  A.  Finally,  it  is  worthwhile 
to  observe  that  crucial  to  our  formulation  is  the  deriva¬ 
tion  of  the  error  bound  eoi'H.V ,T).  As  we  have  noted 
earlier,  this  is  a  measure  of  the  maximum  possible  error 
the  approximation  scheme  H  could  be  forced  to  make 
in  approximating  functions  of  T  using  the  data  set  V. 
Now,  if  one  wanted  an  approximation  scheme  indepen¬ 
dent  bound,  this  would  be  obtained  by  minimzing  cd 
over  all  possible  schemes,  i.e., 

mieoOi.V,  T) 
n 

Any  approximation  scheme  can  be  forced  to  make  at 
least  as  much  error  as  the  above  expression  denotes.  An¬ 
other  bound  of  some  interest  is  obtained  by  removing  the 
dependence  of  eo  on  the  data.  Thus  given  an  approx¬ 
imation  scheme  if  data  V  is  drawn  randomly,  one 
could  compute 

P{eD{n,V,T)>e} 

or  in  an  approximation  scheme-independent  setting,  one 
computes 

P{mfeD{n,V,T)>e} 

The  above  expressions  would  provide  us  PAC-like 
bounds  which  we  will  make  use  of  later  in  this  paper. 

2.3  In  Context 

Having  motivated  and  derived  two  possible  active  strate¬ 
gies,  it  is  worthwhile  at  this  stage  to  comment  on  the  for¬ 
mulation  and  its  place  in  the  context  of  previous  work 
in  similar  vein  executed  across  a  number  of  disciplines. 
1)  Optimal  Recovery:  The  question  of  choosing  the 
location  of  points  where  the  unknown  function  will  be 
sampled  has  been  studied  within  the  framework  of  opti¬ 
mal  recovery  (Micchelli  and  Rivlin,  1976;  Micchelli  and 
Wahba,  1981:  Athavale  and  Wahba,  1979).  While  work 
of  this  nature  has  strong  connections  to  our  formulation. 


there  remains  a  crucial  difference.  Sampling  schemes  mo¬ 
tivated  by  optimal  recovery  are  not  adaptive.  In  other 
words,  given  a  class  of  functions  T  (from  which  the  tar¬ 
get  /  is  selected),  optimal  sampling  chooses  the  points 
Xi  ^  D,i  —  I, ...  ,n  hy  optimizing  over  the  entire  func¬ 
tion  space  T.  Once  these  points  are  obtained,  then  they 
remain  fixed  irrespective  of  the  target  (and  correspond¬ 
ingly  the  data  set  V).  Thus,  if  we  wanted  to  sample  the 
function  at  n  points,  and  had  an  approximation  scheme 
H  with  which  we  wished  to  recover  the  true  target,  a 
typical  optimal  recovery  formulation  would  involve  sam¬ 
pling  the  function  at  the  points  obtained  as  a  result  of 
optimizing  the  following  objective  function: 

arg  min  sup  rf(/, /i(X>  =  {(a;*, /(ir,  )),--i (2) 

where  h{V  =  {(xi,  €  "W  is  the  learner’s 

hypothesis  when  the  target  is  /  and  the  function  is  sam¬ 
pled  at  the  Xi's.  Given  no  knowledge  of  the  target,  these 
points  are  the  optimal  to  sample. 

In  contrast,  our  scheme  of  sampling  can  be  conceived 
as  an  iterative  application  of  optimal  recovery  (one  point 
at  a  time)  by  conditioning  on  the  data  seen  so  far.  Mak¬ 
ing  this  absolutely  explicit,  we  start  out  by  asking  for 
one  point  using  optimal  recovery.  We  obtain  this  point 
by 

arg  min  sup  d{f,h{Vi‘-  {{xij{xi))})) 
fe:F 

Having  sampled  at  this  point  (and  obtained  yi  from  the 
true  target),  we  can  now  reduce  the  class  of  candidate 
target  functions  to  Ti ,  the  elements  of  P  which  are  con- 
sisitent  with  the  data  seen  so  far.  Now  we  obtain  our 
second  point  by 

argmin  sup  d{fJi{V2  =  {{xi.  yi).{x2.  f{x2))})) 

Note  that  the  supremum  is  done  over  a  restricted  set  Pi 
the  second  time.  In  this  fashion,  we  perform  optimal 
recovery  at  each  stage,  reducing  the  class  of  functions 
over  which  the  supremum  is  performed.  It  should  be 
made  clear  that  this  sequential  optimal  recovery  is  not  a 
greedy  technique  to  arrive  at  the  solution  ofeq.  2.  It  will 
give  us  a  different  set  of  points.  Further,  this  set  of  points 
will  depend  upon  the  target  function.  In  other  words, the 
sampling  strategy  adapts  itself  to  the  unknown  target  / 
as  it  gains  more  information  about  that  target  through 
the  data.  We  know  of  no  similar  sequential  sampling 
scheme  in  the  literature. 

While  classical  optimal  recovery  has  the  formulation 
of  eq.  2,  imagine  a  situation  where  a  "teacher”  who 
knows  the  target  function  and  the  learner,  wishes  to  com¬ 
municate  to  the  learner  the  best  set  of  points  to  minimize 
the  error  made  by  the  learner.  Thus  given  a  function  g, 
this  best  set  of  points  can  be  obtained  by  the  following 
optimization 

arg  min  d{g,h{{{xi,gixi))}i=i...„))  (3) 

ri  .....Xn 

Eq.  2  and  eq.  3  provide  two  bounds  on  the  perfor¬ 
mance  of  the  active  learner  following  the  strategy  of  Al¬ 
gorithm  B  in  the  previous  section.  While  eq.  2  chooses 
optimal  points  without  knowing  anything  about  the  tar¬ 
get,  and,  eq.  3  chooses  optimal  points  knowing  the  target 


completely,  the  active  learner  chooses  points  optimally 
on  the  basis  of  partial  information  about  the  target  (in¬ 
formation  provided  by  the  data  set). 

2)  Concept  Learning:  The  PAC  learning  community 
(which  has  traditionally  focused  on  concept  learning) 
typically  incorporates  activity  on  the  part  of  the  learner 
by  means  of  queries,  the  learner  can  make  of  an  ora¬ 
cle.  Queries  (Angluin.  1988)  range  from  membership 
queries  (is  x  an  element  of  the  target  concept  c)  to  sta¬ 
tistical  queries  (Kearns.  1993  :  where  the  learner  can  not 
ask  for  data  but  can  cLsk  for  estimates  of  functionals  of 
the  function  class)  to  arbitrary  boolean  valued  queries 
(see  Kulkarni  etal  for  an  investigation  of  query  complex¬ 
ity).  Our  form  of  activity  can  be  considered  as  a  natural 
adaptation  of  membership  queries  to  the  case  of  learning 
real-valued  functions  in  our  modified  PAC  model.  It  is 
worthwhile  to  mention  relevant  work  which  touches  the 
contents  of  this  paper  at  some  points.  The  most  signifi¬ 
cant  of  these  is  an  investigation  of  the  sample  complex¬ 
ity  of  active  versus  passive  learning  conducted  by  Eisen- 
berg  and  Rivest  (1990)  for  a  simple  class  of  unit  step 
functions.  It  was  found  that  a  binary  .search  algorithm 
could  vastly  outperform  a  passive  learner  in  terms  of  the 
number  of  examples  it  needed  to  (f,^)  learn  the  target 
function.  This  paper  is  very  much  in  the  spirit  of  that 
work  focusing  as  it  does  on  the  sample  complexity  ques¬ 
tion.  Another  interesting  direction  is  the  transformation 
of  PAC-learning  algorithms  from  a  batch  to  online  mode. 
While  Littlestone  etal  (1991)  consider  online  learning 
of  linear  functions.  Kimber  and  Long  (1992)  consider 
functions  with  bounded  derivatives  which  we  examine 
later  in  this  paper.  However  the  question  of  choosing 
one’s  data  is  not  addressed  at  all.  Kearns  and  Schapire 
(1990)  consider  the  learning  of  p-concepts  (which  are 
essentially  equivalent  to  learning  classes  of  real-valued 
functions  with  noise)  and  address  the  learning  of  mono¬ 
tone  functions  in  this  context.  Again,  there  is  no  active 
component  on  the  part  of  the  learner. 

3)  Adaptive  Integration:  The  novelty  of  our  formu¬ 
lation  lies  in  its  adaptive  nature.  There  are  some  simi¬ 
larities  to  work  in  adaptive  numerical  integration  which 
are  worth  mentioning.  Roughly  speaking,  an  adaptive 
integration  technique  (Berntsen  etal  1991,  book???)  di¬ 
vides  the  domain  of  integration  into  regions  over  which 
the  integration  is  done.  Estimates  are  then  obtained 
of  the  error  on  each  of  these  regions.  The  region  with 
maximum  error  is  subdivided.  Though  the  spirit  of  such 
an  adaptive  approach  is  close  to  ours,  specific  results  in 
the  field  naturally  differ  because  of  differences  between 
the  integration  problem  (and  its  error  bounds)  and  the 
approximation  problem. 

4)  Bayesian  and  other  formulations:  It  should  be 
noted  that  we  have  a  worst-case  formulation  (the  supre- 
mum  in  our  formulation  represents  the  maximum  pos¬ 
sible  error  the  scheme  might  have).  Alternate  bayesian 
schemes  have  been  devised  (Mackay,  1991;  Cohn,  1994) 
from  the  perspective  of  optimal  experiment  design  (Fe¬ 
dorov).  Apart  from  the  inherently  different  philosophi¬ 
cal  positions  of  the  two  schemes,  an  indepth  treatment 
of  the  sample  complexity  question  is  not  done.  We 
will  soon  give  two  examples  where  we  address  this  sam¬ 


ple  complexity  question  closely.  In  a  separate  piece  of 
work  (Sung  and  Niyogi,  1994)  ,  the  author  has  also 
investigated  such  bayesian  formulations  from  such  an 
information-theoretic  perspective.  Yet  another  average- 
case  formulation  comes  from  the  information-complexity 
viewpoint  of  Traub  and  Wozniakovski  (see  Traub  etal 
(1988)  for  details).  Various  interesting  sampling  strate¬ 
gies  are  suggested  by  research  in  that  spirit.  \\e  do  not 
attempt  to  compare  them  due  to  the  difficulty  in  com¬ 
paring  worst-case  and  average-case  bounds. 

Thus,  we  have  motivated  and  derived  in  this  section, 
two  possible  active  strategies.  The  formulation  is  gen¬ 
eral.  We  now  demonstrate  the  usefulness  of  such  a  for¬ 
mulation  by  considering  two  classes  of  real- valued  func¬ 
tions  as  examples  and  deriving  specific  active  algorithms 
from  this  perspective.  At  this  stage,  the  important  ques¬ 
tion  of  sample  complexity  of  active  versus  passive  learn¬ 
ing  still  remains  unresolved.  We  investigate  this  more 
closely  by  deriving  theoretical  bounds  and  performing 
empirical  simulation  studies  in  the  case  of  the  specific 
classes  we  consider. 

3  Example  1:  A  Class  of  Monotonically 
Increasing  Bounded  Functions 

Consider  the  following  class  of  functions  from  the  inter¬ 
val  [0, 1]  C  3?  to  : 

y'  “  {/  :  0  <  /  <  M,  and  f{x)  >  /(t/)Vx  >  y} 

Note  that  the  functions  belonging  to  this  class  need  not 
be  continuous  though  they  do  need  to  be  measurable. 
This  class  is  PAC-  learnable  (with  an  Li{P)  norm,  in 
which  case  our  notion  of  PAC  reduces  to  the  classical 
notion)  though  it  has  infinite  pseudo-dimension^(in  the 
sense  of  Pollard  (1984)).  Thus,  we  observe: 

Observation  1  The  class  T  has  infinite  pseudo- 
dimension  (in  the  sense  of  Pollard  (1984),'  Haussler 
(1989),). 

Proof:  To  have  infinite  pseudo-dimension,  it  must  be 
the  case  that  for  every  n  >  0,  there  exists  a  set  of 
points  {xi, . .  .,a:n}  which  is  shattered  by  the  class  P. 
In  other  words,  there  must  exist  a  fixed  translation  vec¬ 
tor  t  =  (ii,...,^n)  such  that  for  every  boolean  vector 
b  =  (6i,...,6n),  there  exists  a  function  f  e  P  which 
satisfies  f{xi)  -  U  >  0  bi  =  1.  To  see  that  this 
is  indeed  the  case,  let  the  n  points  be  Xi  =  i/(n  1) 
for  i  going  from  1  to  n.  Let  the  translation  vector  then 
be  given  by  U  =  Xi.  For  an  arbitrary  boolean  vector 
b  we  can  always  come  up  with  a  monotonic  function 
such  that  f{xi)  =  i/{n  +  1)  —  l/3(n  +  1)  if  6,-  =  0  and 
f{xi)  =  i/{n  +  1)  -b  l/3(n  +  1)  if  bi  =  1.  □ 

We  also  need  to  specify  the  terms  H,  dc,  the  proce¬ 
dure  V  for  partitioning  the  domain  D  =  [0, 1]  and  so 
on.  For  our  purposes,  we  assume  that  the  approxima¬ 
tion  scheme  H  is  first  order  splines.  This  is  simply  find¬ 
ing  the  monotonic  function  which  interpolates  the  data 

^Finite  pseudo-dimension  is  only  a  sufficient  and  not 
necessary  condition  for  PAC  learnability  as  this  example 
demonstrates. 


in  a  piece-wise  linear  fashion.  A  natural  way  to  parti¬ 
tion  the  domain  is  to  divide  it  into  the  intervals  [0,xi), 
[;ci,  x*2), . . . ,  iPz+i), .  ■ . ,  1].  The  metric  dc  is  an  Lp 

metric  given  by  dc(hj2)  ~  ( fo  l/i  “ 

Note  that  we  are  specifically  interested  in  comparing 
the  sample  complexities  of  passive  and  active  learning. 
We  will  do  this  under  a  uniform  distributional  assump¬ 
tion,  i.e.,  the  passive  learner  draws  its  examples  by  sam¬ 
pling  the  target  function  uniformly  at  random  on  its  do¬ 
main  [0,  1].  In  contrast,  we  will  show  how^  our  general 
formulation  in  the  earlier  section  translates  into  a  spe¬ 
cific  active  algorithm  for  choosing  points,  and  we  derive 
bounds  on  its  sample  complexity.  We  begin  by  first  pro¬ 
viding  a  lower  bound  for  the  number  of  examples  a  pas¬ 
sive  PAC  learner  would  need  to  draw  to  learn  this  class 

3.1  Lower  Bound  for  Passive  Learning 

Theorem  1  Any  passive  learning  algorithm  (more 
specifically^  any  approximation  scheme  which  draws  data 
uniformly  at  random  and  interpolates  the  data  by  any 
arbitrary  bounded  function)  will  have  to  draw  at  least 
7i{M/2ey  \n{l/S)  examples  to  P-PAC  learn  the  class 
where  P  is  a  uniform  distribution. 

Proof:  Consider  the  uniform  distribution  on  [0,  1]  and  a 
subclass  of  functions  which  have  value  0  on  the  region  A 
=  [0, 1  ~~{2e/My]  and  belong  to  T .  Suppose  the  passive 
learner  draws  I  examples  uniformly  at  random.  Then 
with  probability  (1  —  (2e/M)^)^  all  these  examples  will 
be  drawn  from  region  A.  It  only  remains  to  show  that 
for  the  subclass  considered,  whatever  be  the  function 
hypothesized  by  the  learner,  an  adversary  can  force  it  to 
make  a  large  error. 

Suppose  the  learner  hypothesizes  that  the  function  is 
h.  Let  the  value  of 

Obviously  0  <  x  < 

=  2e.  If  x  <  then  the  adversary  can 
claim  that  the  target  function  was  really 

0  ^0T  xe  [0, 1  -  {2c/My] 

M  hr  xe(l-{2e/M)yi] 

If,  on  the  other  hand  x  ^  I-hen  the  adversary  can 
claim  the  function  was  really  g  =  0. 

In  the  first  case,  by  the  triangle  inequality, 

>  (/(l-(2./M)M)  -  (/(l-(2./M)M) 

=  2e  -  X  >  f 
In  the  second  case, 

=  (/[., 11  Is  - '‘W'’’ > 

I»  - ‘W"  =*>  ' 

Now  we  need  to  find  out  how  large  /  must  be  so  that 
this  particular  event  of  drawing  all  examples  in  A  is  not 
very  likely,  in  particular,  it  has  probability  less  than  6. 


For  (1  -  {2e/hfyy  to  be  greater  than  <5,  we  need 

^  <  -lnll-(2t/M)>’)  ^  ^ 

^  Making  use  of  this  fact  (and  setting 

a  —  i2e/My,  we  see  that  for  e  <  we  have 

^(.U/2f)Pln(l/6)  <  So  unless  I  is 

greater  than  ^(M/26)^  ln(l/6),  the  probability  that  all 
examples  are  chosen  from  A  is  greater  than  S.  Con¬ 
sequently,  with  probability  greater  than  6,  the  passive 
learner  is  forced  to  make  an  error  of  at  least  e,  and  PAC 
learning  cannot  take  place.  □ 

3.2  Active  Learning  Algorithms 

In  the  previous  section  we  computed  a  lower  bound 
for  passively  PAC  learning  this  class  for  a  uniform 
distribution*^.  Here  we  derive  an  active  learning  strategy 
(the  CLA  algorithm)  which  would  meaningfully  choose 
new  examples  on  the  bctsis  of  information  gathered  about 
the  target  from  previous  examples.  This  is  a  specific  in¬ 
stantiation  of  the  general  formulation,  and  interestingly 
yields  a  'divide  and  conquer”  binary  searching  algorithm 
starting  from  a  different  philosophical  standpoint.  We 
formally  prove  an  upper  bound  on  the  number  of  exam¬ 
ples  it  requires  to  PAC  learn  the  class.  While  this  upper 
bound  is  a  worst  case  bound  and  holds  for  all  functions 
in  the  class,  the  actual  number  of  queries  (examples)  this 
strategy  takes  differs  widely  depending  upon  the  target 
function.  We  demonstrate  empirically  the  performance 
of  this  strategy  for  different  kinds  of  functions  in  the 
class  in  order  to  get  a  feel  for  this  difference.  We  de¬ 
rive  a  classical  non-sequential  optimal  sampling  strategy 
and  show  that  this  is  equivalent  to  uniformly  sampling 
the  target  function.  Finally,  we  are  able  to  empirically 
demonstrate  that  the  active  algorithm  outperforms  both 
the  passive  and  uniform  methods  of  data  collection. 

3.2.1  Derivation  of  an  optimal  sampling 
strategy 

Consider  an  approximation  scheme  of  the  sort  de¬ 
scribed  earlier  attempting  to  approximate  a  target  func¬ 
tion  /  G  on  the  basis  of  a  data  set  V.  Shown  in 
fig.  2  is  a  picture  of  the  situation.  We  can  assume 
without  loss  of  generality  that  we  start  out  by  know¬ 
ing  the  value  of  the  function  at  the  points  a:  =  0  and 
a:  =  1.  The  points  {a:*;  i  =  1, . . .,  n}  divide  the  domain 
into  n  -f  1  intervals  Ci  (i  going  from  0  to  n)  where 
Ci  ^  [art,  a;i-_|_i](xo  ~  =  l).The  monotonicity 

constraint  on  T  permits  us  to  obtain  rectangular  boxes 
showing  the  values  that  the  target  function  could  take  at 
the  points  on  its  domain.  The  set  of  all  functions  which 
lie  within  these  boxes  as  shown  is  Tv  • 

Let  us  first  compute  eCi{H^V^T)  for  some  interval 
Ci.  On  this  interval,  the  function  is  constrained  to  lie 
in  the  appropriate  box.  We  can  zoom  in  on  this  box  as 
shown  in  fig.  3. 

The  maximum  error  the  approximation  scheme  could 


Naturally,  this  is  a  distribution-free  lower  bound  as  well. 
In  other  words,  we  have  demonstrated  the  existence  of  a  dis¬ 
tribution  for  which  the  passive  learner  would  have  to  draw  at 
least  as  many  examples  as  the  theorem  suggests. 


Figure  2;  A  depiction  of  the  situation  for  an  arbitrary 
data  set.  The  set  Tv  consists  of  ail  functions  lying  in  the 
boxes  and  passing  through  the  datapoints  (for  example, 
the  dotted  lines).  The  approximating  function  /i  is  a 
linear  interpolant  shown  by  a  solid  line. 


- . 


/+/ 

c. 

Figure  3:  Zoomed  version  of  interval.  The  maximum 
error  the  approximation  scheme  could  have  is  indicated 
by  the  shaded  region.  This  happens  when  the  adversary 
claims  the  target  function  had  the  value  yi  throughout- 
the  interval. 


have  (indicated  by  the  shaded  region)  is  clearly  given  by 

( j  \h-f{xiWdxy'^  =  (j\^xydxyi^  = 

where  A  =  /(xi+i )  -  f{xi)  and  B  =  {xi+i  -  Xi). 

Clearly  the  error  over  the  entire  domain  ev  is  given 

by 

z'— 0 

The  computation  of  ec  is  all  we  need  to  implement  an 
active  strategy  motivated  by  Algorithm  A  in  section  2. 
All  we  need  to  do  is  sample  the  function  in  the  interval 
with  largest  error:  recall  that  we  need  a  procedure  V 
to  determine  how  to  sample  this  interval  to  obtain  a 
new  data  point.  We  choose  (arbitrarily)  to  sample  the 
midpoint  of  the  interval  with  the  largest  error  yielding 
the  following  algorithm. 

The  Choose  and  Learn  Algorithm  (CLA) 

1.  [Initial  Step]  Ask  for  values  of  the  function  at 
points  a:  =  0  and  X*  =  1.  At  this  stage,  the  domain 
[0, 1]  is  composed  of  one  interval  only,  viz.,  [0, 1]. 
Compute  El  - 

Ti  =  E’l  -  If  Ti  <  e,  stop  and  output  the  linear  inter¬ 
polant  of  the  samples  as  the  hypothesis,  otherwise 
query  the  midpoint  of  the  interval  to  get  a  par¬ 
tition  of  the  domain  into  two  subintervals  [0,1/2) 
and  [1/2, 1]. 

2.  [General  Update  and  Stopping  Rule]  In  gen¬ 
eral,  at  the  A‘th  stage,  suppose  that  our  partition 
of  the  interval  [0,1]  is  [xq  =  0,  a:i),[xi,  0:2),  •  •  • , 
[xk-i,Xk  —  li-  We  compute  the  normalized  error 

Ei  =  for 

all  i  =  1,  The  midpoint  of  the  interval  with 
maximum  Ej  is  queried  for  the  next  sample.  The 
total  normalized  error  Ejy^^  is  com¬ 

puted  at  each  stage  and  the  process  is  terminated 
when  Tk  <  €.  Our  hypothesis  h  at  every  stage  is 
a  linear  interpolation  of  all  the  points  sampled  so 
far  and  our  final  hypothesis  is  obtained  upon  the 
termination  of  the  whole  process. 

Now  imagine  that  we  chose  to  sample  at  a  point  x  G 
Ci  =  [xi.xij^i]  and  received  the  value  y  G  ^n(aj)  (i-e., 
y  in  the  box)  as  shown  in  the  fig.  4.  This  adds  one 
more  interval  and  divides  Ci  into  two  subintervals  Ca 
and  Ci2  where  Cn  =  and  Ci2  =  [x.Xi^i].  We 

also  correspondingly  obtain  two  smaller  boxes  inside  the 
larger  box  within  which  the  function  is  now  constrained 
to  lie.  The  uncertainty  measure  ec  can  be  recomputed 
taking  this  into  account. 

Observation  2  The  addition  of  the  new  data  point 
{x,y)  does  not  change  the  uncertainty  value  on  any  of 
the  other  intervals.  It  only  affects  the  interval  Ci  which 
got  subdivided.  The  total  uncertainty  over  this  interval 
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Figure  4:  The  situation  when  the  interval  Ci  is  sampled 
yielding  a  new  data  point.  This  subdivides  the  interval 
into  two  subintervals  and  the  two  shaded  boxes  indicate 
the  new  constraints  on  the  function. 

is  now  given  by 

-  f(xi)Y  + 

{Xi+i  -  xMf{xi+i)  -  f(xi))  -  yf)-'’’  = 

=  G(zrP+{B-z)iA-r)Py'- 

where  for  convenience  we  have  used  the  subsiiiution  z  = 
x-Xi,  r-  y-  f{xi),  and  A  and  B  are  f{xi^i)  -  f{xi) 
and  Xi^i  —  Xi  as  above.  Clearly  z  ranges  from  0  to  B 
while  r  ranges  from  0  to  A. 

We  first  prove  the  following  lemma: 

Lemma  1 

812  —  arg  min  sup  G  {zr^  \  B  —  z){A  — 

z£[0,B]  rG[0,>t] 

Proof:  Consider  any  z  G  [0,5].  There  are  three  cases 
to  consider: 

Case  1  z  >  B/2  \\ei  z  —  B/2  +  a  where  a  >  0.  We  find 
sup.e[o,A]  +  (5  -  = 

(sup.g[o,A]G(^rP  +  (iJ-r)(.4-r)P))'^' 


BI2{{rlAfA{l-[plA)Y)+a{(rlA)P-{\-rlA)P)  <  5/2+q 

Putting  (d  —  vjA  (clearly  /?  G  [0,1],  and  noticing  that 
(1  —  <\  -  and  —  (1  -  (dy  <  1  the  inequality 

above  is  established.  Consequently,  we  are  able  to  see 
that 

sup  G  (zr^  H-  (5  —  z)(A  —  =  G(5/2  +  A 

re[0,A] 

Case  II  Let  z  ==  5/2  ~  a  for  a  >  0.  In  this  case,  by 
a  similar  argument  as  above,  it  is  possible  to  show  that 
again, 

sup  G{zrP  +  [B-  z){A  -  r)pf'^  =  G{B/2  +  af^PA 

rG[0,A] 

Case  III  Finally,  let  z  =  5/2.  Here 

sup.6[o,A]  +  (S  -  z){A  -  r)pf^  = 

G(S/2)'/Psup,e[o.A](''^  +  M-00'^'’ 

Clearly,  then  for  this  case,  the  above  expression  is  re¬ 
duced  to  GA(5/2)^/^.  Considering  the  three  cases,  the 
lemma  is  proved. □ 

The  above  lemma  in  conjunction  with  eq.  4  and  obser¬ 
vation  2  proves  that  if  we  choose  to  sample  a  particular 
interval  Ci  then  sampling  the  midpoint  is  the  optimal 
thing  to  do.  In  particular,  we  see  that 

supj^e[/(r.),/(r.+,))  ec.(W,  T>  U  (x,  y),T)  = 
(^)1/P(£,±p,)i/P(^(^,_^i)  _  ^  ec.i7i,V,X)l2^IP 

In  other  words,  if  the  learner  were  constrained  to  pick 
its  next  sample  in  the  interval  G^,  then  by  sampling  the 
midpoint  of  this  interval  Gf,  the  learner  ensures  that  the 
maximum  error  it  could  be  forced  to  make  by  a  malicious 
adversary  is  minimized.  In  particular,  if  the  uncertainty 
over  the  interval  Ci  with  its  current  data  set  V  is  ec,, 
the  uncertainty  over  this  region  will  be  reduced  after 
sampling  its  midpoint  and  can  have  a  maximum  value  of 
ec./2i/P. 

Now  which  interval  must  the  learner  sample  to  mini¬ 
mize  the  maximum  possible  uncertainty  over  the  entire 
domain  D  ~  [0, 1].  Noting  that  if  the  learner  chose  to 
sample  the  interval  G*  then 

rainj,gCi=[xi,xi+,]  supygjTj,  eD=[o,i]i'U,  V  U  (x,  y),r)  = 


Now, 

SUPrg[0,A]  +  (B  -  z)iA  -  r'f)  = 


(n,V,T) 


sup,e[0,A]  G  i{B/2  +  a)r^  +  (5/2  -  o)(T  -  r)^) 

-  Gsup,,^o,A]  B/2{rP  +  {A~  r)n  +  a(rP  -  -  r)^) 

Now  for  r  —  A,  the  expression  within  the  supremum 
5/2(rP  +  (A~r)^)  +  a(r^  is  equal  to  (5/2-1- 

a)AP .  For  any  other  r  G  [0,  A],  we  need  to  show  that 

5/2(rP  +  (A  -  rf)  4-  a(r^  -  (.4  -  r)P)  <  (5/2  -h  a)AP 


From  the  decomposition  above,  it  is  clear  that  the  opti¬ 
mal  point  to  sample  according  to  the  principle  embodied 
in  Algorithm  B  is  the  midpoint  of  the  interval  Cj  which 
has  the  maximum  uncertainty  ec^  (W,  T>,  T)  on  the  basis 
of  the  data  seen  so  far,  i.e.,  the  data  set  7).  Thus  we  can 
state  the  following  theorem 

Theorem  2  The  CLA  is  the  optimal  algorithm  for  the 
class  of  monotonic  functions 


Having  thus  established  that  our  binary  searching  al¬ 
gorithm  (CLA)  is  optimal,  we  now  turn  our  efforts  to  de¬ 
termining  the  number  of  examples  the  CLA  would  need 
in  order  to  learn  the  unknown  target  function  to  c  accu¬ 
racy  with  6  confidence.  In  particular,  we  can  prove  the 
following  theorem. 

Theorem  3  The  CLA  converges  in  ai  most  {M/e)P 
sieps.  Specifically,  after  collecting  ai  most  {M/e)^  exam¬ 
ples.  its  hypothesis  is  e  close  to  the  target  with  probability 

1. 

Proof  Sketch:  The  proof  of  convergence  for  this  algo¬ 
rithm  is  a  little  tedious.  However,  to  convince  the  reader, 
we  provide  the  proof  of  convergence  for  a  slight  vari¬ 
ant  of  the  active  algorithm.  It  is  possible  to  show  (not 
shown  here)  that  convergence  times  for  the  active  algo¬ 
rithm  described  earlier  is  bounded  by  the  convergence 
time  for  the  variant.  First,  consider  a  uniform  grid  of 
points  {e/Mf  apart  on  the  domain  [0.  li.  Now  imagine 
that  the  active  learner  works  just  as  described  earlier 
but  with  a  slight  twist,  viz.,  it  can  only  query  points  on 
this  grid.  Thus  at  the  kih  stage,  instead  of  querying  the 
true  midpoint  of  the  interval  with  largest  uncertainty,  it 
will  query  the  gridpoint  closest  to  this  midpoint.  Ob¬ 
viously  the  intervals  at  the  kih  stage  are  also  separated 
by  points  on  the  grid  (i.e.  previous  (jueries).  If  it  is  the 
case  that  the  learner  has  queried  all  the  points  on  the 
grid,  then  the  maximum  possible  error  it  could  make  is 
less  than  e.  To  see  this,  let  a  —  e/M  and  let  us  first  look 
at  a  specific  small  interval  [ka,  (Ar+  1)^^]-  know  the 
following  to  be  true  for  this  subinterval: 
fika)  ~  h(ka)  <  f{x),h{x)  <  f((k-^\)a)  =  h({k~\-l)a) 
Thus 

\f(x)  -  h{x)\  <  f{{k  -f  l)a)  -  f{ka) 
and  so  over  the  interval  [ka,  {k  -h  l)a] 

\f{x)-h(x)\Pdx< 

Jl'T^''(mk+l)a)-nka)fdx 

<ifi{k+l)a)-f{ka)}Pa 
It  follows  that 
/[o.i]  1-^  “  h\Pdx  = 

a((/(Q)  -  /(0))P  +  (/(2a)  -  /(a))P  +  . . .  + 

(/(I  -  a)  -  /(I  -  2a))P  +  (/(I)  -f(l-a)r)< 

Q(/(a)  -  /(O)  +  /(2a)  -  /(a)  +  , , .  +  /(I)  -  /(I  -  a))P 

<a(/(l)-/(0))P  <aMP 

So  if  a  =  (e/My,  we  see  that  the  Lp  error  would  be  at 
/  \  t/p 

most  f  1/  -  h\Pdxj  <  e.  Thus  the  active  learner 

moves  from  stage  to  stage  collecting  examples  at  the  grid 
points.  It  could  converge  at  any  stage,  but  clearly  after  it 
has  seen  the  value  of  the  unknown  target  at  all  the  grid- 
points,  its  error  is  provably  less  than  e  and  consequently 
it  must  stop  by  this  time.  □ 


Figure  5:  How  the  CLA  chooses  its  examples.  Vertical 
lines  have  been  drawn  to  mark  the  x-coordinates  of  the 
points  at  which  the  algorithm  asks  for  the  value  of  the 
function. 


3.3  Empirical  Simulations,  and  other 
Investigations 

Our  aim  here  is  to  characterise  the  performance  of  CLA 
as  an  active  learning  strategy.  Remember  that  CLA  is 
an  adaptive  example  choosing  strategy  and  the  number 
of  samples  it  would  take  to  converge  depends  upon  the 
specific  nature  of  the  target  function.  We  have  already 
computed  an  upper  bound  on  the  number  of  samples  it 
would  take  to  converge  in  the  worst  case.  In  this  section 
we  try  to  provide  some  intuition  as  to  how  this  sampling 
strategy  differs  from  random  draw  of  points  (equivalent 
to  passive  learning)  or  drawing  points  on  a  uniform  grid 
(equivalent  to  optimal  recovery  following  eq.  2  as  we  shall 
see  shortly).  We  perform  simulations  on  arbitrary  mono¬ 
tonic  increasing  functions  to  better  characterize  condi¬ 
tions  under  which  the  active  strategy  could  outperform 
both  a  passive  learner  as  well  as  a  uniform  learner. 

3.3.1  Distribution  of  Points  Selected 

As  has  been  mentioned  earlier,  the  points  selected  by 
CLA  depend  upon  the  specific  target  Inunction. Shown  in 
fig.  3-5  is  the  performance  of  the  algorithm  for  an  ar¬ 
bitrarily  constructed  monotonically  increasing  function. 
Notice  the  manner  in  which  it  chooses  its  examples.  In¬ 
formally  speaking,  in  regions  where  the  function  changes 
a  lot  (such  regions  can  be  considered  to  have  high  in¬ 
formation  density  and  consequently  more  “interesting”), 
CLA  samples  densely.  In  regions  where  the  function 
doesn’t  change  much  (correspondingly  low  information 
density),  it  samples  sparsely.  As  a  matter  of  fact,  the 
density  of  the  points  seems  to  follow  the  derivative  of 
the  target  function  as  shown  in  fig.  6. 

Consequently,  we  conjecture  that 

Conjecture  1  The  density  of  points  sampled  by  the  ac¬ 
tive  learning  algorithm  is  proportional  to  the  derivative 
of  the  function  at  that  point  for  differentiable  functions. 

Remarks: 


Figure  6:  The  dotted  line  shows  the  density  of  the  sam¬ 
ples  along  the  x-axis  when  the  target  was  the  monotone- 
function  of  the  previous  example.  The  hold  line  is  a  plot 
of  the  derivative  of  the  function.  Notice  the  correlation 
between  the  two. 


1.  The  CL  A  seems  to  sample  functions  according  to 
its  rate  of  change  over  the  different  regions.  We 
have  remarked  earlier,  that  the  best  possible  sam¬ 
pling  strategy  would  be  obtained  by  eq.  3  earlier. 
This  corresponds  to  a  teacher  (who  knows  the  tar¬ 
get  function  and  the  learner)  selecting  points  for 
the  learner.  How  does  the  CL  A  sampling  strategy 
differ  from  the  best  possible  one?  Does  the  sam¬ 
pling  strategy  converge  to  the  best  possible  one  as 
the  data  goes  to  infinity?  In  other  words,  does  the 
CLA  discover  the  best  strategy?  These  are  inter¬ 
esting  questions.  We  do  not  know  the  answer. 

2.  We  remarked  earlier  that  another  bound  on  the 
performance  of  the  active  strategy  was  that  pro¬ 
vided  by  the  classical  optimal  recovery  formulation 
of  eq.  2.  This,  as  we  shall  show  in  the  next  section, 
is  equivalent  to  uniform  sampling.  We  remind  the 
reader  that  a  crucial  difference  between  uniform 
sampling  and  CLA  lies  in  the  fact  that  CLA  is  an 
adaptive  strategy  and  for  some  functions  might  ac¬ 
tually  learn  with  very  few  examples.  We  will  ex¬ 
plore  this  difference  soon. 

3.3.2  Classical  Optimal  Recovery 

For  an  Li  error  criterion,  classical  optimal  recovery 
as  given  by  eq.  2  yields  a  uniform  sampling  strategy.  To 
see  this,  imagine  that  we  chose  to  sample  the  function  at 
points  Xi;i  —  1, . . . ,  n.  Pick  a  possible  target  function  / 
and  let  yi  =  f{xi)  for  each  i.  We  then  get  the  situation 
depicted  in  fig.  7.  The  n  points  divide  the  domain  into 
n  -h  1  intervals.  Let  these  intervals  have  length  a*  each  as 
shown.  Further,  if  [xi-i,Xi]  corresponds  to  the  interval 
of  length  ai,  then  let  yi  -  yi^i  =  hi.  In  other  words  we 
would  get  n  -b  1  rectangles  with  sides  a,-  and  hi  as  shown 
in  the  figure. 

It  is  clear  that  choosing  a  vector  b  —  (6i, . . . ,  6n  10 


Figure  7:  The  situation  when  a  function  f  E  !F  is  picked, 
n  sample  points  (the  a:'s)  are  chosen  and  the  correspond¬ 
ing  y  values  are  obtained.  Each  choice  of  sample  points 
corresponds  to  a  choice  of  the  a’s.  Each  choice  of  a  func¬ 
tion  corresponds  to  a  choice  of  the  6's. 


with  the  constraint  that  hi  =  M  and  hi  >  0  is 

equivalent  to  defining  a  set  of  y  values  (in  other  words, 
a  data  set)  which  can  be  generated  by  some  function  in 
the  class  T .  Specifically,  the  data  values  at  the  respective 
sample  points  would  be  given,  by  2/1=61,  2/2  =  61  -f  62 
and  so  on.  We  can  define  to  be  the  set  of  mono¬ 
tonic  functions  in  T  which  are  consistent  with  these  data 
points.  In  fact,  every  f  E  T  would  map  onto  some  b, 
and  thus  belong  to  some  Consequently, 

^  =  '^{b:6,>0  Vi,  -M}^b 

Given  a  target  function  /  E  ^  choice  of 

n  points  one  can  construct  the  data  set  V  — 
{(xj,  f{xi))}i=i,,,n  and  the  approximation  scheme  gener¬ 
ates  an  approximating  function  h(V).  It  should  be  clear 

that  for  an  L\  distance  metric  (d(/, /i)  =  Jq  \f  —  h\dx), 
the  following  is  true: 

1  1 
sup  d{f,  h)  =  -Y]  o-i^i  = 

Thus,  taking  the  supremum  over  the  entire  class  of 
functions  is  equivalent  to 

sup  c/(/, /i(7)))  =  sup 

{b:6.>0,^6»-M}  ^ 

The  above  is  a  straight  forward  linear  program¬ 
ming  problem  and  yields  as  its  solution  the  result 
|M  max{  =  l,...,(n+  1)}. 

Finally,  every  choice  of  n  points  Xi,i  =  1, . . . ,  n  results 
in  a  corresponding  vector  a  where  ai  >  0  and  ^  =  1. 

Thus  minimizing  the  maximum  error  over  all  the  choice 
of  sample  points  (according  to  eq.  2)  is  equivalent  to 

argniini;j,...,a;„  supygjr(i(/. /j(I>  =  {(xi, /(a;!))}t=i  ■  «)  = 
argmin^a:ai>o,^a.=i}  max{a,;i  =  1 . .  .n  +  1} 


Clearly  the  solution  of  the  above  problem  is  ai  =  ~ 
for  each  i. 

In  other  words,  classical  optimal  recovery  suggests 
that  one  should  sample  the  function  uniformly.  Note 
that  this  is  not  an  adaptive  scheme.  In  the  next  section, 
we  compare  empirically  the  performance  of  three  differ¬ 
ent  schemes  to  sample.  The  passive,  where  one  samples 
randomly,  the  non-sequential  “optimaP,  where  one  sam¬ 
ples  uniformly,  and  the  active  which  follows  our  sequen¬ 
tially  optimal  strategy. 

3.3.3  Error  Rates  and  Sample  Complexities  for 
some  Arbitrary  Functions:  Some 
Simulations 

In  this  section,  we  attempt  to  relate  the  number  of 
examples  drawn  and  error  made  by  the  learner  for  a 
variety  of  arbitrary  monotone  increasing  functions.  We 
begin  with  the  following  simulation: 

Simulation  A: 

1.  Pick  an  arbitrary  monotone-increasing  function. 

2.  Decide  (.Vi.  the  number  of  samples  to  be  collected. 
There  are  three  methods  of  collection  of  samples. 
The  first  is  by  randomly  drawing  N  examples  ac¬ 
cording  to  a  uniform  distribution  on  [0,  1]  (corre¬ 
sponding  to  the  passive  case).  The  second  is  by 
asking  for  function  values  on  a  uniform  grid  on  [0, 1] 
of  grid  spacing  l/N.  The  third  is  the  CL  A. 

3.  The  three  learning  algorithms  differ  only  in  their 
method  of  obtaining  samples.  Once  the  samples  are 
obtained,  all  three  algorithms  attempt  to  approxi¬ 
mate  the  target  by  the  monotone  function  which  is 
the  linear  interpolant  of  the  samples. 

4.  This  entire  process  is  now  repeated  for  various  val¬ 
ues  of  .V  for  the  same  target  function  and  then 
repeated  again  for  different  target  functions. 

Results:  Let  us  first  consider  performance  on  the  arbi¬ 
trarily  selected  monotonic  function  of  the  earlier  section. 
Shown  in  fig.  S  are  performance  for  the  three  different  al¬ 
gorithms.  Notice  that  the  active  learning  strategy  (CLA) 
has  the  lowest  error  rate.  On  an  average,  the  error  rate  of 
random  sampling  is  8  times  the  rate  of  CLA  and  uniform 
sampling  is  l.o  times  the  rate  of  CLA. 

Figure  9  shows  four  other  monotonic  functions  on 
which  we  ran  the  same  simulations  comparing  the  three 
sampling  strategies.  The  results  of  the  simulations  are 
shown  in  Fig.  10  and  Table  3.3.3.  Notice  that  the  ac¬ 
tive  strategy  (CLA)  far  outperforms  the  passive  strategy 
and  clearly  has  the  best  error  performance.  The  compar¬ 
ison  between  uniform  sampling  and  active  sampling  is 
more  interesting.  For  functions  like  function-2  (which  is 
a  smooth  approximation  of  a  step  function),  where  most 
of  the  '‘information”  is  located  in  a  small  region  of  the 
domain,  CLA  outperforms  the  uniform  learner  by  a  large 
amount.  Functions  like  function-3  which  don’t  have  any 
clearly  identified  region  of  greater  information  have  the 
least  difference  between  CLA  and  the  uniform  learner  (as 
also  between  the  passive  and  active  learner).  Finally  on 
functions  which  lie  in  between  these  two  extremes  (like 
functions  4  and  5)  we  see  decreased  error-rates  due  to 
CLA  which  are  in  between  the  two  extremes. 
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Figure  8:  Error  rates  as  a  function  of  the  number  of 
examples  for  the  arbitrary  monotone  function  shown  in 
a  previous  figure. 

In  conclusion,  the  active  learner  outperforms  the  pas¬ 
sive  learner.  Further,  it  is  even  better  than  classical 
optimal  recovery.  The  significant  advantage  of  the  ac¬ 
tive  learner  lies  in  its  adaptive  nature.  Thus,  for  certain 
“easy”  functions,  it  might  converge  very  rapidly.  For 
others,  it  might  take  as  long  as  classical  optimal  recov¬ 
ery,  though  never  more. 

4  Example  2:  A  Class  of  Functions 
with  Bounded  First  Derivative 

Here  the  class  of  functions  we  consider  are  from  [0, 1]  to 
R  and  of  the  form 

—  {f\f{x)  is  differentiable  and  |^|  <  d} 

Notice  a  few  things  about  this  class.  First,  there  is  no  di¬ 
rect  bound  on  the  values  that  functions  in  T  can  take.  In 
other  words,  for  every  M  >  0,  there  exists  some  function 
/  E  ^  such  that  f{x)  >  M  for  some  x  E  [0, 1].  However, 
there  is  a  bound  on  the  first  derivative,  which  means 
that  a  particular  function  belonging  to  cannot  itself 
change  very  sharply.  Knowing  the  value  of  the  function 
at  any  point,  we  can  bound  the  value  of  the  function  at 
all  other  points.  So  for  example,  for  every  /  E  .F,  we  see 
that  \f{x)\  <  dxf{0)  <  df{0). 

We  observe  that  this  class  too  has  infinite  pseudo¬ 
dimension.  We  state  this  without  proof. 

Observation  3  The  class  T  has  infinite  pseudo¬ 
dimension  in  the  sense  of  Pollard. 

As  in  the  previous  example  we  w’ould  like  to  investi¬ 
gate  the  possibility  of  devising  active  learning  strategies 
for  this  class.  We  first  provide  a  lower  bound  on  the 
number  of  examples  a  learner  (whether  passive  or  ac¬ 
tive)  would  need  in  order  to  e  identify  this  class.  We 
then  derive  in  the  next  section,  an  optimal  active  learn¬ 
ing  strategy  (that  is,  an  instantiation  of  the  Active  Algo¬ 
rithm  B  earlier).  We  also  provide  an  upper  bound  on  the 
number  of  examples  this  active  algorithm  would  take. 
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Figure  9;  Four  other  monotonic  functions  on  which  simulations  have  been  run  comparing  random,  uniform,  and 
active  sampling  strategies. 


Figure  10-  This  figure  plots  the  log  of  the  error  (Li  error)  against  N  the  number  of  examples  for  each  of  the  4 
monotonic  functions  shown  in  fig.  9.  The  solid  line  represents  error  rates  for  " 

dashes  represents  uniform  sampling  and  the  line  with  long  dashes  represents  results  for  CLA.  Notice  how  GLA  beats 
random  sampling  by  large  amounts  and  does  slightly  better  than  uniform  sampling. 


Fn.  No. 

Avg.  Random/CLA 

Avg.  Uniform/CLA 

1 

7.23 

1.66 

2 

61.37 

10.91 

3 

6.67 

1.10 

4 

8.07 

1.62 

5 

6.62 

1.56 

Table  1:  Shown  in  this  table  is  the  average  error  rate  of 
the  random  sampling  and  the  uniform  sampling  strate¬ 
gies  when  as  a  multiple  of  the  error  rates  due  to  CLA. 
Thus  for  the  function  3  for  example,  uniform  error  rates 
are  on  an  average  1.1  times  CLA  error  rates.  The  av¬ 
erages  are  taken  over  the  different  values  of  N  (number 
of  examples)  for  which  the  simulations  have  been  done. 
Note  that  this  is  not  a  very  meaningful  average  as  the  dif¬ 
ference  in  the  error  rates  between  the  various  strategies 
grow  with  N  (as  can  be  seen  from  the  curves)if  there  is 
a  difference  in  the  order  of  the  sample  complexity.  How¬ 
ever  they  have  been  provided  just  to  give  a  feel  for  the 
numbers. 


We  also  need  to  specify  some  other  terms  for  this 
class  of  functions.  The  approximation  scheme  'H  is  a 
first  order  spline  as  before,  the  domain  D  —  [0, 1]  is 
partitioned  into  intervals  by  the  data  (again 

as  before)  and  the  metric  dc  is  an  L\  metric  given  by 
dc{f\,h)  —  The  results  in  this 

section  can  be  extended  to  an  Lp  norm  but  we  confine 
ourselves  to  an  Li  metric  for  simplicity  of  presentation. 

4.1  Lower  Bounds 

Theorem  4  Any  learning  algorithm  (whether  passive  or 
active)  has  to  draw  at  least  Q{{d/e))  examples  (whether 
randomly  or  by  choosing)  in  order  to  FAC  learn  the  class 
T. 


Lemma  2  There  exists  a  function  f  £  f  such  that  f 
interpolates  the  data  and 

f  \f\dx  >  - 

J[o,i]  4(m-hl) 

where  h  is  a  constant  arbitrarily  close  to  1. 

Proof:  Consider  fig.  11.  The  function  /  is  indicated  by 
the  dark  line.  As  is  shown,  /  changes  sign  at  each  x  — 
Xi.  Witiiout  loss  of  generality,  we  consider  an  interval 
of  length  Let  the  midpoint  of  this  interval 
he  z  —  [xi  -h  Xi^i)/2.  The  function  here  has  the  values 

{d{x  -  Xi)  for  x  E[xi,z  —  a] 

~d{x  —  ^24-1)  for  X  e  [z  a j 

forx€[z-a,z  +  a] 

Simple  algebra  shows  that 

\f\dx  >  =  d{bf  -  a2)/4 

Clearly,  a  can  be  chosen  small,  so  that 


Proof  Sketch:  Let  us  assume  that  the  learner  col¬ 
lects  m  examples  (passively  by  drawing  according  to 
some  distribution,  or  actively  by  any  other  means).  Now 
we  show  that  an  adversary  can  force  the  learner  to  make 
an  error  of  atleast  e  if  it  draws  less  than  Q((d/e))  exam¬ 
ples.  This  is  how  the  adversary  functions. 

At  each  of  the  m  points  which  are  collected  by  the 
learner,  the  adversary  claims  the  function  has  value  0. 
Thus  the  learner  is  reduced  to  coming  up  with  a  hy¬ 
pothesis  that  belongs  to  T  and  which  it  claims  will  be 
within  an  e  of  the  target  function.  Now  we  need  to  show 
that  whatever  the  function  hypothesized  by  the  learner, 
the  adversary  can  always  come  up  with  some  other  func¬ 
tion,  also  belonging  to  F,  and  agreeing  with  all  the  data 
points,  which  is  more  than  an  e  distance  away  from  the 
learner’s  hypothesis.  In  this  way,  the  learner  will  be 
forced  to  make  an  error  greater  than  e. 

The  m  points  drawn  by  the  learner,  divides  the  re¬ 
gion  [0, 1]  into  (at  most)  m  4-  1  different  intervals.  Let 
the  length  of  these  intervals  be  61, 62,  63, ...,  6,724.1.  The 
“true”  function,  or  in  other  words,  the  function  the  ad¬ 
versary  will  present,  should  have  value  0  at  the  endpoints 
of  each  of  the  above  intervals.  We  first  state  the  following 
lemma. 


Figure  11:  Construction  of  a  function  satisying  Lemma 
2. 
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where  k  is  as  close  to  1  as  we  want.  By  combining  the 
different  pieces  of  the  function  we  see  that 


U 

4 


i 


Now  we  make  use  of  the  following  lemma, 

Lemma  3  For  a  set  of  numbers  61,  such  that  61  -h 
62  +  ..  +  6,n  =  1;  the  following  inequality  is  true 

61  +  62  +  ..  +  6^  >  1/m 

Proof:  By  induction.  □ 

Now  it  is  easy  to  see  how  the  adversary  functions. 
Suppose  the  learner  postulates  that  the  true  function  is 
h.  Let  ^^\h\dx  =  X-  If  x  >  6,  the  adversary  claims  that 

the  true  function  was  /  =  0.  In  that  case  fj  \h  —  f\dx  = 


X  >  e.  If  on  the  other  hand.  \  <  6,  then  the  adversary 
claims  that  the  true  function  was  /  (as  above).  In  that 
case, 

fl  fl  rl 

Clearly,  if  711  is  less  than  ~  -  L  the  learner  is  forced 
again  to  make  an  error  greater  than  e.  Thus  in  either 
case,  the  learner  is  forced  to  make  an  error  greater  than 
or  equal  to  e  if  less  than  Q{d/e)  examples  are  collected 
(howsoever  these  examples  are  collected).  □ 

The  previous  result  holds  for  all  learning  algorithms. 
It  is  possible  to  show  the  following  result  for  a  passive 
learner. 

Theorem  5  A  Passive  learner  must  draw  ai  least 
max(Q((d/e),  y/((i/e)  ln(l/(5)))  to  learn  this  class. 

Proof  Sketch:  The  d/e  term  in  the  lower  bound  follows 
directly  from  the  previous  theorem.  We  show  how  the 
second  term  is  obtained. 

Consider  the  uniform  distribution  on  [0,  1]  and  a  sub¬ 
class  of  functions  which  have  value  0  on  the  region  A 
-  [0, 1-a]  and  belong  to  /*.  Suppose  the  passive  learner 
draws  /  examples  uniformly  at  random.  Then  with  prob¬ 
ability  (1  -  ay,  all  these  examples  will  be  drawn  from 
region  A.  It  only  remains  to  show  that  for  this  event, 
and  the  subclass  considered,  whatever  be  the  function 
hypothesized  by  the  learner,  an  adversary  can  force  it  to 
make  a  large  error. 

It  is  easy  to  show  (using  the  arguments  of  the  earlier 
theorem)  that  there  exists  a  function  f  £  T  such  that 
/  is  0  on  A  and  \f\dx  =  ^a^d.  This  is  equal  to  26 
if  a  =  y/{4e/d).  Now  let  the  learner’s  hypothesis  be  h. 
Let  l/zjdx  =  X-  If  Y  is  greater  than  e,  the  adversary 
claims  the  target  was  g  ~  0.  Otherwise,  the  adversary 
claims  the  target  was  ^  =  /.  In  either  case,  f  \g  -  h\dx  > 
6. 

It  is  possible  to  show  (by  an  identical  argument  to  the 
proof  of  theorem  1),  that  unless  /  >  ^^y{d/e)  \n{l/6),  all 
examples  will  be  drawn  from  A  with  probability  greater 
than  6  and  the  learner  will  be  forced  to  make  an  error 
greater  than  e.  Thus  the  second  term  appears  indicating 
the  dependence  on  6  in  the  lower  bound. □ 

4.2  Active  Learning  Algorithms 

We  now  derive  in  this  section  an  algorithm  which  ac¬ 
tively  selects  new  examples  on  the  basis  of  information 
gathered  from  previous  examples.  This  illustrates  how 
our  formulation  of  section  3.1.1  can  be  used  in  this  case 
to  effectively  obtain  an  optimal  adaptive  sampling  strat¬ 
egy. 

4.2.1  Derivation  of  an  optimal  sampling 
strategy 

Fig.  12  shows  an  arbitrary  data  set  containing  in¬ 
formation  about  some  unknown  target  function.  Since 
the  target  is  known  to  have  a  first  derivative  bounded  by 
d,  it  is  clear  that  the  target  is  constrained  to  lie  within 
the  parallelograms  shown  in  the  figure.  The  slopes  of 


the  lines  making  up  the  parallelogram  are  d  and  -d  ap¬ 
propriately.  Thus,  consists  of  all  functions  which  lie 
within  the  parallelograms  and  interpolate  the  data  set. 
We  can  now  compute  the  uncertainty  of  the  approxima¬ 
tion  scheme  over  any  interval, C,  (given  by  ec('H,'D ■ 
for  this  case.  Recall  that  the  approximation  scheme  Ti 
is  a  first  order  spline,  and  the  data  V  consists  of  (x.y) 
pairs.  Fig.  13  shows  the  situation  for  a  particular  inter¬ 
val  {Ci  =  Here  i  ranges  from  0  to  n.  As  in 

the  previous  example,  we  let  xq  =  0,  and  =  1. 

The  maximum  error  the  approximation  scheme  'H 
could  have  on  this  interval  is  given  by  (half  the  area 
of  the  parallelogram). 

ec.{n,V,J^)=  sup  [  \h  -  f\dx  = 
fe:Fv  Jc, 

where  Ai  =  \f{xi+i)  -  f{xi)\  and  Bi  =  Xi+i  -  xi. 
Clearly,  the  maximum  error  the  approximation  scheme 
could  have  over  the  entire  domain  is  given  by 

n  »  n 

eD=[o.i]('H,'D,J^}=  sup  /  \f  -  h\dx  =  '^ec, 

j  =  j-0 

(0) 

The  computation  of  ec  is  crucial  to  the  derivation  of 
the  active  sampling  strategy.  Now  imagine  that  we  chose 
to  sample  at  a  point  x  in  the  interval  Ci  and  received 
a  value  y  (belonging  to  J^v{x))^  This  adds  one  more 
interval  and  divides  Ci  into  two  intervals  Cn  and  C,2  as 
shown  in  fig.  14..  We  also  obtain  two  correspondingly 
smaller  parallelograms  within  which  the  target  function 
is  now  constrained  to  lie. 

The  addition  of  this  new  data  point  to  the  data  set 
(I)'  —  P  U  (x,  y))  requires  us  to  recompute  the  learner’s 
hypothesis  (denoted  by  in  the  fig.  14).  Correspond¬ 
ingly,  it  also  requires  us  to  update  ec,  i.e.,  we  now  need 
to  compute  ec('W,P^  First  we  observe  that  the  addi¬ 
tion  of  the  new  data  point  does  not  affect  the  uncertainty 
measure  on  any  interval  other  than  the  divided  interval 
Ci.  This  is  clear  when  we  notice  that  the  parallelograms 
(whose  area  denotes  the  uncertainty  on  each  interval) 
for  all  the  other  intervals  are  unaffected  by  the  new  data 
point. 

Thus, 

ec,{n,V',T)  =  ec,{'H,V,T)  =  for  j  ^  i 

For  the  tth  interval  Ci,  the  total  uncertainty  is  now  re¬ 
computed  as  (half  the  sum  of  the  two  parallelograms  in 
fig.  14) 

{dHBi  -  u)2  -  (yl,  -  r;)2))  (6) 

=  X  ((rf2„2  +  d^Bi  -  uf)  -  -h  (>l  - 

where  u  =■  x  —  Xi,  v  =■  y—yi,  and  Ai  and  Bi  are  as  before. 
Note  that  u  ranges  from  0  to  Bi,  for  Xi  <  x  <  Xi^i. 
However,  given  a  particular  choice  of  x  (this  fixes  a  value 
of  w),  the  possible  values  v  can  take  are  constrained  by 
the  geometry  of  the  parallelogram.  In  particular,  r  can 


only  lie  within  the  parallelogram.  For  a  particular  x, 
we  know  that  Tvix)  represents  the  set  of  all  possible  y 
values  we  can  receive.  Since  i  —  y  —  yi ,  it  is  clear  that 
V  G  ^v{^)  -  Vi-  Naturally,  if  y  <  yi.  we  find  that  <  0, 
and  Ai  —  v  >  Ai.  Similarly,  if  y  >  we  find  that 

r  >  .4^ 

We  now  prove  the  following  lemma; 

Lemma  4  The  following  two  ideuiiiies  are  valid  for  the 
appropriate  mini-max  problem. 

Identity  1: 

f  =  argminug(o,B]Siip,.g|^,,i .  _y,} 

tvhere  Hi{u,  v)  =  ((d“tr  ~r  d^[B  —  n)“)  — 

A- {A  -  v)‘^)) 

Identity  2: 

-  A^)  =  minj,g[o,B]  sup,  fhiuyv) 

where  H-iiu,  v)  —  ((d^tr  +  d~{B  —  u  i')  — 

Proof:  The  expression  on  the  right  is  a  difference 
of  two  quadratic  expressions  and  can  be  expressed  as 
qi{u)  —  For  a  particular  t/.  the  expression  is  max¬ 
imized  when  the  quadratic  +  (A  —  t^)  )  is 

minimized.  Observe  that  this  quadratic  is  globally  min¬ 
imized  at  t;  =  A/2.  We  need  to  perform  this  minimiza¬ 
tion  over  the  set  v  G  IFTy{x)  —  Pi  (this  is  the  set  of  values 
which  lie  within  the  upper  and  lower  boundaries  of  the 
parallelogram  shown  in  fig.  15).  1  here  are  three  cases  to 
consider. 

Case  I:  u  G  [A/2d,  B  —  A/2d] 

First,  notice  that  for  u  in  this  range,  it  is  easy  to  verify 
that  the  upper  boundary  of  the  parallelogram  is  greater 
than  A/2  while  the  lower  boundary  is  less  than  A/2. 
Thus  we  can  find  a  value  of  v  (viz.  v  =  A/2)  which  glob¬ 
ally  minimizes  this  quadratic  because  A/2  G  Tvi^)  —  Vi- 
The  expression  thus  reduces  to  d'  ir  —d~  [B  —  u)^  —  A'^ /2. 
Over  the  interval  for  u  considered  in  this  case,  it  is  min¬ 
imized  Sit  u  =  B/2  resulting  in  the  value 

-  A-)/2 

Case  II:  u  G  [0,A/2d] 

In  this  case,  the  upper  boundary  of  the  parallelogram 
(which  is  the  maximum  value  r  can  take)  is  less  than 
A/2  and  hence  the  is  minimized  when  v  =  du.  The 
total  expression  then  reduces  to 

d^u^  A  d\B  -  uf  -  {{duf  A  (A  -  duf) 

~  d?{B  —  —  (A  -  duY  —  (d^^B-  -  A^)  -  2ud{dB  —  A) 

Since,  dB  >  A,  the  above  is  minimized  on  this  interval 
by  choosing  u  =  A/2d  resulting  in  the  value 

dB{dB- A) 

Case  III:  By  symmetry,  this  reduces  to  case  II. 

Since  (d^BA  -AB')/2  <  dB(dB-A)  (this  is  easily  seen 
by  completing  squares),  it  follows  that  u  =  B/2  is  the 


global  solution  of  the  mini-max  problem  above.  Further, 
we  have  shown  that  for  this  value  of  u,  the  sup  term 
reduces  to  [dAB~  -  A'')/2  and  the  lemma  is  proved. □ 

Using  the  above  lemma  along  with  eq.  6,  we  see  that 

min^i^ec,  (x,  y),/*)  = 

In  other  words,  by  sampling  the  midpoint  of  the  inter¬ 
val  Ci,  we  are  guaranteed  to  reduce  the  uncertainty  by 
1/2.  As  in  the  case  of  monotonic  functions  now,  we  see 
that  using  eq.  5,  we  should  sample  the  midpoint  of  the 
interval  with  largest  uncertainty  ec,{TL,V,T)  to  obtain 
the  global  solution  in  accordance  with  the  principle  of 
Algorithm  B  of  section  2. 

This  allows  us  to  formally  state  an  active  learning 
algorithm  which  is  optimal  in  the  sense  implied  in  our 
formulation. 

The  Choose  and  Learn  Algorithm  -  2  (CLA-2) 

1.  [Initial  Step]  Ask  for  values  of  the  function 

at  points  x  =  0  and  x  =  1.  At  this  stage, 
the  domain  D  =  [0, 1]  is  composed  of  one  in¬ 
terval  only,  viz.,  Ci  ^  [0,1].  Compute  ec^  = 
^  -  1/(1)  -  /(0)p)  and  eo  =  ec,-  If  cd  <  e, 

stop  and  output  the  linear  interpolant  of  the  sam¬ 
ples  as  the  hypothesis,  otherwise  query  the  mid¬ 
point  of  the  interval  to  get  a  partition  of  the  domain 
into  two  subintervals  [0, 1/2)  and  [1/2, 1]. 

2.  [General  Update  and  Stopping  Rule]  In  gen¬ 

eral,  at  the  klh  stage,  suppose  that  our  partition 
of  the  interval  [0, 1]  is  [xq  =  0,  a:i),[a;i,  ^^2),  •  •  •  1 
[xk-i^Xk  =  1].  We  compute  the  uncertainty 
ec,  =  ^  (d-(xr  -  -  \yi  -  2/,-i|^)  for  each 

i  =  1, . . . ,  The  midpoint  of  the  interval  with 
maximum  ec,  is  queried  for  the  next  sample.  The 
total  error  eo  =  is  computed  at  each 

stage  and  the  process  is  terminated  when  e/)  <  c. 
Our  hypothesis  h  at  every  stage  is  a  linear  interpo¬ 
lation  of  all  the  points  sampled  so  far  and  our  final 
hypothesis  is  obtained  upon  the  termination  of  the 
whole  process. 

It  is  possible  to  show  that  the  following  upperbound 
exists  on  the  number  of  examples  CLA  would  take  to 
learn  the  class  of  functions  in  consideration 

Theorem  6  The  CLA-2  would  PAC  learn  the  class  in 
at  most  ^  +  1  examples. 

Proof  Sketch:  Following  a  strategy  similar  to  the  proof 
of  Theorem  3,  we  show  how  a  slight  variant  of  CLA-2 
would  converge  in  at  most  [d/AeA  1)  examples.  Imagine 
a  grid  of  n  points  placed  \/(n—  1)  apart  on  the  domain 
D  =  [0,  1]  where  the  kth  point  is  k/(n  -  1)  (for  k  going 
from  0  to  n  -  1).  The  variant  of  the  CLA-2  operates 
by  confining  its  queries  to  points  on  this  grid.  Thus 
at  the  kth  stage,  instead  of  querying  the  midpoint  of 
the  interval  with  maximum  uncertainty,  it  will  query  the 
gridpoint  closest  to  this  midpoint.  Suppose  it  uses  up  all 
the  gridpoints  in  this  fashion,  then  there  will  be  n  —  1 
intervals  and  by  our  arguments  above,  we  have  seen  that 


the  maximum  error  on  each  interval  is  bounded  by 


1  ^ 


-?  -  \m 


Since  there  are  n  —  1  such  intervals,  the  total  error  it 
could  make  is  bounded  by 


^dii  -  r 


In  example  1.  for  the  case  of  monotone  functions,  we 
saw  that  the  density  of  sample  points  was  proportional 
to  the  first  derivative  of  the  target  function.  By  contrast, 
in  this  example,  the  optimal  strategy  chooses  to  sample 
points  in  a  way  which  is  inversely  proportional  to  the 
magnitude  of  the  first  derivative  of  the  target  function. 
Fig.  18  exemplifies  this. 

4.3.2  Error  Rates: 


It  is  easy  to  show  that  for  n  >  d/4(  +  1.  this  maximum 
error  is  less  than  e.  Thus  the  learner  need  not  collect  any 
more  than  d/ie  +  l  examples  to  learn  the  target  function 
to  within  an  e  accuracy.  Note  that  the  learner  will  have 
identified  the  target  to  e  accuracy  with  probability  1  (al- 
wavs)  by  following  the  strategy  outlined  in  this  variant 
ofCLA-2.  □ 

We  now  have  both  an  upper  and  lower  bound  for  PAC- 
learning  the  class  (under  a  uniform  distribution)  with 
queries.  Notice  that  here  as  well,  the  sample  complexity 
of  active  learning  does  not  depend  upon  the  confidence 
parameter  6.  Thus  for  S  arbitrarily  small,  the  difference 
in  sample  complexities  between  passive  and  active  learn¬ 
ing  becomes  arbitrarily  large  with  active  learning  requir¬ 
ing  much  fewer  examples. 

4.3  Some  Simulations 

We  now  provide  some  simulations  conducted  on  arbi¬ 
trary  functions  of  the  class  of  functions  with  bounded 
derivative  (the  class  !F).  Fig.  16  shows  4  arbitrary  se¬ 
lected  functions  which  were  chosen  to  be  the  target  func¬ 
tion  for  the  approximation  scheme  considered.  In  par¬ 
ticular,  we  are  interested  in  observing  how  the  active 
strategy  samples  the  target  function  for  each  case.  Fur¬ 
ther,  we  are  interested  in  comparing  the  active  and  pas¬ 
sive  techniques  with  respect  to  error  rates  for  the  same 
number  of  examples  drawn.  In  this  case,  we  have  been 
unable  to  derive  an  analytical  solution  to  the  clcissical 
optimal  recovery  problem.  Hence,  we  do  not  compare  it 
as  an  alternative  sampling  strategy  in  our  simulations. 

4.3.1  Distribution  of  points  selected 

The  active  algorithm  CLA-2  selects  points  adaptively 
on  the  basis  of  previous  examples  received.  Thus  the 
distribution  of  the  sample  points  in  the  domain  D  of 
the  function  depends  inherently  upon  the  arbitrary  tar¬ 
get  function.  Consider  for  example,  the  distribution  of 
points  when  the  target  function  is  chosen  to  be  Function- 
1  of  the  set  shown  in  fig.  16. 

Notice  (as  shown  in  fig.  17)  that  the  algorithm  chooses 
to  sample  densely  in  places  where  the  target  is  flat,  and 
less  densely  where  the  function  has  a  steep  slope.  As  our 
mathematical  analysis  of  the  earlier  section  showed,  this 
is  well  founded.  Roughly  speaking,  if  the  function  has 
the  same  value  at  Xi  and  Xi^i,  then  it  could  have  a  vari¬ 
ety  of  values  (wiggle  a  lot)  within.  However,  if,  f{xi^i) 
is  much  greater  (or  less)  than  f{xi),  then,  in  view  of  the 
bound,  d,  on  how  fast  it  can  change,  it  would  have  had  to 
increase  (or  decrease)  steadily  over  the  interval.  In  the 
second  case,  the  rate  of  change  of  the  function  over  the 
interval  is  high,  there  is  less  uncertainty  in  the  values  of 
the  function  within  the  interval,  and  consequently  fewer 
samples  are  needed  in  between. 


In  an  attempt  to  relate  the  number  of  examples  drawn 
and  the  error  made  by  the  learner,  we  performed  the 
following  simulation. 

Simulation  B: 

1.  Pick  an  arbitrary  function  from  class  JT. 

2.  Decide  N,  the  number  of  samples  to  be  collected. 
There  are  two  methods  of  collection  of  samples. 
The  first  (passive)  is  by  randomly  drawing  N  ex¬ 
amples  according  to  a  uniform  distribution  on  [0,1]. 
The  second  (active)  is  the  CLA-2. 

3.  The  two  learning  algorithms  differ  only  in  their 
method  of  obtaining  samples.  Once  the  samples 
are  obtained,  both  algorithms  attempt  to  approx¬ 
imate  the  target  by  the  linear  interpolant  of  the 
samples  (first  order  splines). 

4.  This  entire  process  is  now  repeated  for  various  val¬ 
ues  of  N  for  the  same  target  function  and  then  re¬ 
peated  again  for  the  four  different  target  functions 
of  fig.  16 

The  results  are  shown  in  fig.  19.  Notice  how  the  active 
learner  outperforms  the  passive  learner.  For  the  same 
number  of  examples,  the  active  scheme  having  chosen  its 
examples  optimally  by  our  algorithm  makes  less  error. 

We  have  obtained  in  theorem  6,  an  upper  bound  on 
the  performance  of  the  active  learner.  However,  as  we 
have  already  remarked  earlier,  the  number  of  examples 
the  active  algorithm  takes  before  stopping  (i.e.,  out- 
putting  an  £-good  approximation)  varies  and  depends 
upon  the  nature  of  the  target  function.  “Simple”  func¬ 
tions  are  learned  quickly,  “difficult”  functions  are  learned 
slowly.  As  a  point  of  interest,  we  have  shown  in  fig.  20, 
how  the  actual  number  of  examples  drawn  varies  with  e. 
In  order  to  learn  a  target  function  to  6-accuracy,  CLA-2 
needs  at  most  nmax(f)  =  d/46 -h  1  examples.  However, 
for  a  particular  target  function,  /,  let  the  number  of  ex¬ 
amples  it  actually  requires  be  n/(e).  We  plot  as  a 

function  of  e.  Notice,  first,  that  this  ratio  is  always  much 
less  than  1.  In  other  words,  the  active  learner  stops 
before  the  worst  case  upper  bound  with  a  guaranteed 
6-good  hypothesis.  This  is  the  significant  advantage  of 
an  adaptive  sampling  scheme.  Recall  that  for  uniform 
sampling  (or  classical  optimal  recovery  even)  we  would 
have  no  choice  but  to  ask  for  d/46  examples  to  be  sure  of 
having  an  6- good  hypothesis.  Further,  notice  that  that 
as  6  gets  smaller,  the  ratio  gets  smaller.  This  suggests 
that  for  these  functions,  the  sample  complexity  of  the 
active  learner  is  of  a  different  order  (smaller)  than  the 
worst  case  bound.  Of  course,  there  always  exists  some 
function  in  which  would  force  the  active  learner  to 
perform  at  its  worst  case  sample  complexity  level. 
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Figure  16:  Four  functions  with  bounded  derivative  considered  in  the  simulations.  The  uniform  bound  on  the  derivative 
was  chosen  to  be  d  =  10. 
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Figure  19:  Results  of  Simulation  B.  Notice  how  the  sampling  strategy  of  the  active  learner  causes  better  approximation 
(lower  rates)  for  the  same  number  of  examples. 


Figure  20:  Variation  with  epsilons. 


5  Conclusions,  Extensions,  and  Open 
Problems 

This  paper  focused  on  the  possibility  of  devising  active 
strategies  to  collect  data  for  the  problem  of  approximat¬ 
ing  real-valued  function  classes.  We  were  able  to  derive 
a  sequential  version  of  optimal  recovery.  This  sequential 
version,  by  virtue  of  using  partial  information  about  the 
target  function  is  superior  to  classical  optimal  recovery. 
This  provided  us  with  a  general  formulation  of  an  adap¬ 
tive  sampling  strategy,  which  we  then  demonstrated  on 
two  example  cases.  Theoretical  and  empirical  bounds  on 
the  sample  complexity  of  passive  and  active  learning  for 
these  cases  suggest  the  superiority  of  the  active  scheme 
as  far  as  the  number  of  examples  needed  is  concerned. 
It  is  worthwhile  to  observe  that  the  same  general  frame¬ 
work  gave  rise  to  completely  different  sampling  schemes 
in  the  two  examples  we  considered.  In  one.  the  learner 
sampled  densely  in  regions  of  high  change.  In  the  other, 
the  learner  did  the  precise  reverse.  This  should  lead  us  to 
further  appreciate  the  fact  that  active  sampling  strate¬ 
gies  are  very  task-dependent. 

Using  the  same  general  formulation,  we  were  also  able 
to  devise  active  strategies  (again  with  superior  sample 
complexity  gain)  for  the  following  concept  classes.  1)  For 
the  class  of  indicator  functions  {l[a,i>]  :  0  <  a  <  6  <  1} 
on  the  interval  [0, 1],  the  sample  complexity  is  reduced 
from  i/£ln(l/(5)  for  passive  learning  to  ln(l/c)  by  adding 
membership  queries.  2)  For  the  class  of  half-spaces  on 
a  regular  n-simplex,  the  sample  complexity  is  reduced 
from  n/eln(l/(5)  to  'n?\n{s/c)  by  adding  membership 


this  class  by  Eisenberg  (1992)  using  a  different  frame¬ 
work. 

There  are  several  directions  for  further  research. 
First,  one  could  consider  the  possibility  of  adding  noise 
to  our  formulation  of  the  problem.  Noisy  versions  of  op¬ 
timal  recovery  exist  and  this  might  not  be  conceptually 
a  very  difficult  problem.  Although  the  general  formula¬ 
tion  (at  least  in  the  noise-free  case)  is  complete,  it  might 
not  be  possible  to  compute  the  uncertainty  bounds  ec 
for  a  variety  of  function  classes.  Without  this,  one  could 
not  actually  use  this  paradigm  to  obtain  a  specific  algo¬ 
rithm.  A  natural  direction  to  pursue  would  be  to  inves¬ 
tigate  other  classes  (especially  in  more  dimensions  than 
1)  and  other  distance  metrics  to  obtain  further  specific 
results.  We  observed  that  the  active  learning  algorithm 
lay  between  classical  optimal  recovery  and  the  optimal 
teacher.  It  would  be  interesting  to  compare  the  exact 
differences  in  a  more  principled  way.  In  particular,  an 
interesting  open  question  is  whether  the  sampling  strat¬ 
egy  of  the  active  learner  converges  to  that  of  the  optimal 
teacher  as  more  and  more  information  becomes  available. 
It  would  not  be  unreasonable  to  expect  this,  though  pre¬ 
cise  results  are  lacking.  In  general,  on  the  theme  of  better 
characterizing  the  conditions  under  which  active  learn¬ 
ing  would  vastly  outperform  passive  learning  for  function 
approximation,  much  work  remains  to  be  done.  While 
active  learning  might  require  fewer  examples  to  learn 
the  target  function,  its  computational  burden  is  signif¬ 
icantly  larger.  It  is  necessary  to  explore  the  informa¬ 
tion/computation  trade-off  with  active  learning  schemes. 
Finally,  we  should  note,  that  we  have  adopted  in  this 
part,  a  model  of  learning  motivated  by  PAC  but  with  a 


crucial  clifFerence,  The  distance  metric,  d,  is  not  nece^ 
sarily  related  to  the  distribution  according  to  which  data 
is  drawn  (in  the  passive  case).  This  prevents  us  from  us¬ 
ing  traditional  uniform  convergence  (Vapnik,  1982)  type 
arguments  to  prove  learnability.  The  problem  of  learning 
under  a  different  metric  is  an  interesting  one  and  merits 
further  investigation  in  its  own  right. 
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C  1 

Figure  12:  An  arbitrary  data  set  for  the  case  of  functions 
with  a  bounded  derivative.  The  functions  in  Tv  are 
constrained  to  lie  in  the  parallelograms  as  shown.  The 
slopes  of  the  lines  making  up  the  parallelogram  are  d  and 
—d  appropriately. 


c. 


Figure  15:  A  figure  to  help  the  visualization  of  Lemma  4. 
For  the  x  shown,  the  set  Tv  is  the  set  of  all  values  which 
lie  within  the  parallelogram  corresponding  to  this  x,  i.e., 
on  the  vertical  line  drawn  at  x  but  within  the  parallelo¬ 
gram. 


c.- 

Figure  13:  A  zoomed  version  of  the  zth  interval. 


Figure  14:  Subdivision  of  the  zth  interval  when  a  new 
data  point  is  obtained. 


Figure  17:  How  CLA-2  chooses  to  sample  its  points. 
Vertical  lines  have  been  drawn  at  the  x  values  where  the 
CLA  queried  the  oracle  for  the  corresponding  function 
value. 
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